Commit Graph

3220 Commits

Author SHA1 Message Date
juuso-oskari
614afea7eb CK-UA: derive kBlockQ at runtime, decouple from variant template
kBlockQ (= kBlockM / num_queries_per_kv) was constexpr in
`UnifiedAttentionShape` / the kernel-traits, forcing one kernel
instance per (kBlockM, num_qpkv) pair even though the matmul tile is
fully determined by kBlockM and kHeadDim. Audit confirmed kBlockQ
only feeds:

  * arithmetic in `unified_attention_kernel.hpp` (loop bounds, Q-tile
    indexing, query_len padding),
  * `pad_tensor_view` size tuples for Q/O/LSE DRAM views,
  * one `mask.IsEdgeTile(... number<kBlockQ>{} ...)` call inside the
    pipeline's per-K-tile mask check.

None of these structurally need a compile-time value:

* `pad_tensor_view` already accepts mixed runtime/compile-time tuple
  elements (e.g. it's passed plain `1` next to `kHeadDimPadded`).
* `IsEdgeTile` only does runtime arithmetic on the tile size; adding a
  runtime overload that accepts `index_t` is trivial (the compile-time
  one now forwards to it).

Wiring:
  * `block_masking.hpp` -- add an `IsEdgeTile(..., index_t tile_h,
    index_t tile_w)` overload; the existing `number<>` overload just
    forwards to it.
  * `unified_attention_pipeline.hpp` -- new optional
    `num_queries_per_kv` arg on the pipeline's `operator()` (default 0
    keeps existing call sites unchanged). Computes
    `kBlockQ_dyn = (num_qpkv > 0) ? (kBlockM / num_qpkv) : kBlockQ`
    once at the top, uses it in the IsEdgeTile call.
  * `unified_attention_kernel.hpp` -- compute
    `const index_t kBlockQ_dyn = kBlockM / kargs.num_queries_per_kv`
    once and replace every per-call `kBlockQ` use with `kBlockQ_dyn`.
    Pass `kargs.num_queries_per_kv` through to the pipeline. The
    debug-only assert(`kBlockQ_dyn == kBlockQ`) keeps the static and
    dynamic values in lock-step until we actually collapse variants.

Perf A/B (b=4..256, sk=120000, MI300):

  d=128 MHA (num_qpkv = 1, runtime div is trivial):
    BW within +/-0.2% across all batch sizes (noise).

  d=64 GQA-8 (num_qpkv = 8, runtime division actually happens):
    speedups 1.28x..2.14x vs Triton -- identical to baseline.

Correctness suite stays at 241/245 (same 4 pre-existing int32-overflow
failures in the d=128 prefill rebased-pointer path).

This is a no-op on perf and unlocks a follow-up where we collapse the
two num_qpkv values per (head_dim, kBlockM) -- e.g. the future d=128
GQA-8 variant can reuse the existing decode_d128_mha_* instances by
just passing a different runtime num_queries_per_kv.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-12 12:01:59 +00:00
juuso-oskari
f5beedb2e9 Add CK-UA decode_d128_mha_m32 / _m16 small-Q tiers
For pure-decode workloads (sq=1) at d=128 the m128 tile wastes most of
its 128 query rows, capping CK below Triton on every batch size in our
sweep (4..256). Add two small-Q tiers that mirror the d=64 GQA-8 ladder:

  * decode_d128_mha_m16 : kBlockM=16, 1 warp, 16x16 MFMA  (tiny-decode)
  * decode_d128_mha_m32 : kBlockM=32, 1 warp, 32x32 MFMA  (tiny-decode)

select_config now ladders by (avg_q, max_q): m16 -> m32 -> m128 -> prefill.

d=128 MHA, hq=16/hk=16, sq=1, sk=120k, num_blocks=60k:
  batch  before    after    CK BW
      4  ~0.95x   0.98x   4.76 TB/s
      8  ~0.85x   1.29x   5.00 TB/s
     32  ~0.85x   1.14x   5.29 TB/s
     64  ~0.75x   0.93x   5.35 TB/s
    128  ~1.00x   1.09x   5.39 TB/s
    256  ~1.03x   1.02x   5.41 TB/s

Correctness suite stays at 241/245 (same 4 known int32-overflow
failures in the prefill path).

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-12 11:48:19 +00:00
juuso-oskari
fb0d729fbb Collapse CK-UA traits into single kernel_traits<V, DType, IsMask> template
Replace 4 near-identical *_kernel_traits classes (~400 lines of repeated
shape/policy plumbing) with one templated `unified_attention_kernel_traits`
parameterized by `KernelVariant V`. The 6 dispatch_<variant> helpers in
unified_attention.cpp collapse into a single `dispatch_variant<V>` function
template that fans out over (dtype, mask).

Per-variant compile-time knobs (BlockM, BlockSize, warp count, MFMA shape,
pipeline policy, decode-grid flag) now live in one variant_config<V>
specialization each. "What's different between variants" is readable
top-to-bottom in a single block of code, and each instance .cpp shrinks to
a one-line `INST_UNIFIED_ATTENTION_DISPATCH(V, DTYPE, IS_MASK)` macro.

No behavior change. Correctness suite: 236/240 (same 4 known
num_blocks=32768 + d=128 MHA int32-overflow failures as baseline).

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-12 10:35:15 +00:00
juuso-oskari
5bd8f73a28 Delete CK-UA bs32 variant family
The bs32 variants existed because pre-fix the pipeline required
kBlockN <= page_size, so page_size=32 forced a kBlockN=32 kernel
family. The multi-page-tile fix (commit 473869aba) lifted that
constraint and made kBlockN compile-time-independent of the runtime
page size, so the bs32 family is now redundant: every non-bs32 variant
is correct for any page_size.

This was validated in advance by forcing use_bs32=false in the
dispatcher and running the full correctness suite -- 236/240, identical
to baseline (the 4 remaining failures are the pre-existing int32-
overflow case, orthogonal).

Removes:
  * 16 instances/unified_attention_*_bs32_*.cpp files
  * unified_attention_decode_bs32_kernel_traits in unified_attention_impl.hpp
  * 3 _BS32 dispatch macros in unified_attention.cpp
  * 3 _p32 entries from the KernelVariant enum
  * 3 dispatch_*_p32 helper functions and their switch cases
  * the page_blk_size branch in select_config (now a pure tile-tier ladder)

Net: 12 fewer compile units (build time -6s on JIT), 78 fewer dispatcher
lines, and "which kernel runs?" is now driven purely by Q-tile shape.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-12 09:41:41 +00:00
juuso-oskari
fddb0d21cd Add d=128 MHA decode variant (decode_d128_mha_m128)
Until now every d=128 MHA workload took the 8-warp prefill kernel
(kBlockM=256, kBlockQ=256), wasting 255/256 Q rows on pure-decode
shapes where Q is 1. Add a dedicated 4-warp decode variant with
kBlockM=128 (kBlockQ=128) that cuts the Q-tile waste roughly in half.

  * Four new instance files at instances/unified_attention_d128_*_decode.cpp,
    each instantiating unified_attention_decode_kernel_traits<dt, mask, 128, 128, 1>.
  * KernelVariant::decode_d128_mha_m128 wired into select_config: chosen
    when both avg_q and max_seqlen_q fit in 128, else fall back to prefill.

Tests: ua-test-scripts/test_unified_attention_ck_correctness.py stays at
236/240 -- the pure-decode seq_lens pattern in head_config=(16,16,128)
now routes to the new variant and matches the torch reference. The 4
remaining failures are the pre-existing int32-overflow case (orthogonal).

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-12 09:34:52 +00:00
juuso-oskari
3ab4df37e2 Refactor CK-UA dispatcher around KernelVariant + select_config
The previous dispatcher was a 4-deep nested-if cascade that picked one
of seven DISPATCH_* macros based on (hdim, num_queries_per_kv, dtype,
mask, tile_tier, use_bs32). The macro names hid both the traits class
and the dispatch path, so reasoning about "what kernel runs for shape
X" required reading the whole file.

Replace it with two named layers:

  1. KernelVariant enum -- a flat list of every compiled instance.
  2. select_config(args) -- the only place runtime decisions live;
     reads the problem and emits a KernelConfig{variant, ...}.

The final switch over the variant calls into per-variant dispatch
helpers that fan out over (dtype, mask) via the existing DISPATCH_*
macros. Behaviour is unchanged: each old (hdim, nqpkv, tier, p32) tuple
maps 1:1 to a KernelVariant, and the same instance is launched.

Follow-up commits in this series will:
  - add a dedicated d=128 MHA decode variant
  - delete the _p32 ("bs32") family now that the multi-page-tile fix
    in the pipeline makes kBlockN independent of page_size

Test: ua-test-scripts/test_unified_attention_ck_correctness.py
      stays at 236/240 (same 4 pre-existing int32-overflow failures).
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-12 09:27:59 +00:00
juuso-oskari
25364aa634 Add KV-segment parallelism to CK unified attention pipeline
End-to-end split-KV (FlashDecoding-style) for the CK unified attention
kernel. The host launches a single 3D grid with z == num_splits; each
CTA computes its KV-range slice and writes a normalized (o_acc, lse)
partial to FP32 workspaces, which the caller reduces into the final
output.

Pipeline changes:
- operator() returns ck_tile::make_tuple(o_acc, lse) instead of just
  o_acc. The masked-empty early-exit returns lse = -inf so downstream
  combine weighs the partial as zero.
- LSE is built in the natural-log domain from the pipeline's *unscaled*
  rowmax: lse = (scale_s / log2(e)) * m + log(l). Previously we used
  m / log2(e) + log(l), which dropped the per-head scale and produced
  LSE values ~1/scale too large.
- Fix post-process parity: which SP register is left in the
  alu0-done/alu1-pending state at loop exit depends on the parity of
  the *iteration count* (= num_total_loop - num_blocks_start), not on
  num_total_loop alone. For non-split (num_blocks_start == 0) the two
  parities coincide; for splits starting at an odd tile they don't.
- Fix split-KV page-table offset: num_blocks_start is counted in
  kPageBlockSize-sized tiles, but block_tables is indexed in
  page_size-sized pages — shifting block_table_offset by num_blocks_start
  reads the wrong pages whenever kPageBlockSize != page_size. Replaced
  with split_token_offset = num_blocks_start * kPageBlockSize added to
  logical_token before /page_size, so the page lookup uses the absolute
  token position.

Kernel + dispatcher:
- Drop kargs.i_split; each CTA reads i_split = blockIdx.z.
- GridSize{2D,Decode} now take num_splits and add it as the z-dim
  (defaults to 1, so non-split callers see dim3(..., 1, 1)).
- New write path: when num_splits > 1, the kernel skips the user
  epilogue and instead writes the FP32 (o_acc, lse) tile pair into
  workspace tensors at [head, split, batch_start_token, ...] using
  Default2DEpilogue (UseRawStore=true) for o_acc and store_tile for
  lse. Host strides come from kargs.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-12 08:42:09 +00:00
juuso-oskari
473869aba5 Lift kPageBlockSize <= page_size constraint in CK-UA pipeline
Refactor the K/V DRAM access in the unified-attention pipeline to use
tile_scatter_gather with a unified per-(thread, Y0-iter) page-offset
formula:

    logical_token = tile_idx * kPageBlockSize + thread_N_pos + i * Y0_step_N
    logical_page  = logical_token / page_size
    within_page   = logical_token % page_size
    phys_page     = block_tables[block_table_offset + logical_page]
    page_offsets[i] = (phys_page * page_size + within_page) * row_stride

The page indirection now lives entirely in page_offsets, refreshed via
update_page_idx() between iters. The per-iter SRD rebase
(set_bottom_tensor_view_data_ptr + init_raw) and the use_ptr_rebase
overflow heuristic are gone.

Effects:
 - The assertion kv_page_size_in_blocks >= 1 (i.e. kPageBlockSize <=
   page_size) in the kernel is dropped. Tiles may now span multiple
   cache pages, as long as Y0_step_N (= N1*N2 from the K/V tile dist)
   divides page_size so that a wave-wide load never straddles a page.
 - Pipeline arg renamed kv_page_size_in_blocks -> page_size (PageSize
   in tokens). Kernel passes kargs.page_size through directly.
 - Validated correctness vs Triton on bf16 / d=64 / decode_s with
   block_size in {16, 32, 64}; max abs diff 1.22e-04 in all cases.
   Perf is on par with the prior pass-1 scaffolding (~3.6 ms on the
   131072-context shape).

TODO(overflow): page_offsets are index_t; caches whose
num_blocks * page_size * row_stride exceeds INT32_MAX will wrap.
A future change should plumb long_index_t through the scatter-gather
load path or compute a per-batch min-page shift in a pre-pass.

TODO(unsupported regime): page_size < Y0_step_N (a wave crosses a page
mid-iter) needs per-lane VGPR SRDs and is not implemented.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-11 10:04:01 +00:00
root
8506db8761 Fix int32 overflow in CK-UA pipeline via pointer rebasing
tensor_coordinate::get_offset() returns index_t (int32), causing overflow
when page_idx * block_size * stride > 2^31 (~131K blocks for d64/GQA-8).

Fix: rebase K/V data pointer for each page using int64 arithmetic instead
of set_window_origin with large offsets. After rebasing p_data_ and
buffer_size_, call init_raw() to refresh the AMD buffer resource descriptor,
then set_window_origin({0,0}) to reset cached coordinates.

Tested: num_blocks up to 2M with nkh=1/8, blk=32/64. All pass.
Made-with: Cursor
2026-04-02 09:39:07 +00:00
root
e8587b86c2 Fix CK-UA pipeline: s_waitcnt_vmcnt<0> in fmha_post_process
The final V tile's async load was not properly waited on before reading
from LDS: s_waitcnt_vmcnt<K_inst> allowed V_inst outstanding loads
(a no-op when K_inst == V_inst). The last loop iteration never prefetches
K, so only V is outstanding. Use s_waitcnt_vmcnt<0> unconditionally.

This partially fixes the BS32 race condition for production workloads
(maxk >= 256). A deeper pipeline race remains for very short KV
sequences (maxk < ~165, 2-5 pages) with block_size=32 at high batch.

Made-with: Cursor
2026-04-01 23:04:07 +00:00
root
87d16738bf WIP: CK-UA KV-segment parallelism - kernel args and split range
Added split-KV fields to UnifiedAttentionVarlenKargs (num_splits,
i_split, lse_acc_ptr, o_acc_ptr + strides). Modified operator() to
compute per-split KV range using blocks_per_split.

INCOMPLETE: The pipeline returns normalized o_acc but the split-KV
combine kernel needs unnormalized o_acc + lse. Need to modify the
pipeline to optionally return m and l values alongside o_acc.

The kernel changes compile but the epilogue needs the split path
(write to float accumulators instead of final output).

Made-with: Cursor
2026-04-01 19:09:59 +00:00
root
63821af1ff Add split-KV decode tiles (b16x32, b32x32) + fix num_splits heuristic
Decode tiles for split-KV hdim=64: bm0=16/1-warp and bm0=32/2-warp.
Fix num_splits to use num_heads_kv (not num_heads_q) and target 4x SMs.

Performance unchanged (0.056ms) because:
1. Split+combine overhead dominates for short KV (31 pages)
2. Triton 3D's single-kernel split avoids combine kernel entirely

Made-with: Cursor
2026-04-01 18:49:16 +00:00
root
c5600bc8ae Add decode tiles (b16x32, b32x32) to pagedkv_prefill codegen with max_seqlen_q dispatch
Made-with: Cursor
2026-04-01 18:30:06 +00:00
root
65a3f88ad8 Fix CK-UA mixed batch: use max_seqlen_q for tier selection
Decode grid (num_kv_heads, num_seqs) assumes each seq has <= kBlockQ
tokens. For mixed batches (decode + prefill), avg_q is low but some
seqs have hundreds of tokens, causing truncation. Added max_seqlen_q
to args and check it in select_tile_tier to force medium tier (1D
grid with Q tile iteration) for mixed batches.

362/362 no-window shapes now pass.

Made-with: Cursor
2026-04-01 18:09:48 +00:00
root
07ba03bcbf Fix sliding window mask: use window_generic when left >= 0
mask_info::decode('b:left,right,sink') always created mask_bottom_right
(IsLocal=false) which ignores the left window boundary. For sliding
window attention (left >= 0), use window_generic (IsLocal=true) so the
kernel respects the left boundary.

Fixes: CK split-KV producing identical results with and without sliding
window. Now 724/724 shapes pass correctness vs Triton.

Made-with: Cursor
2026-04-01 18:00:19 +00:00
root
e5272603c9 Wire FmhaFwdPagedKV: enable bf16 hdim=64 with bn0=32 for page_block_size=32
Made-with: Cursor
2026-04-01 17:18:41 +00:00
root
10564b0c40 Enable FmhaFwdPagedKV bf16 hdim=64 instances (was commented out)
Made-with: Cursor
2026-04-01 16:49:20 +00:00
root
cd7ba6e2e8 Add unified attention (42_unified_attention)
Squashed from aghamari/unified-attention-decode-opt branch.

CK tile paged-KV attention kernel optimized for decode with 4-tier
dispatch (tiny/small/medium/large), 16x16 MFMA, 2D decode grid,
head-group merging. Supports hdim=64 GQA-8 and hdim=128 MHA with
block_size=32.

Made-with: Cursor
2026-04-01 16:39:15 +00:00
root
ec2db01e4a Fix fmha_fwd early-exit bug: seqlen_q <= min_seqlen_q should be <
The kSkipMinSeqlenQ optimization incorrectly used <= comparison, causing
the kernel to skip batches where seqlen_q equals min_seqlen_q. This
happens in the common case of no padding (all batches have the same
seqlen_q == min_seqlen_q), producing all-zero output silently.

Changed to strict < so batches with exactly min_seqlen_q tokens are
still processed.

Made-with: Cursor
2026-04-01 16:24:31 +00:00
root
cb6fb2802d Split-KV codegen: dual-tile dispatch and head-merge for hdim=64
1. Dual-tile: add both bn0=64 (preferred) and bn0=32 (fallback) for
   hdim=64 on gfx9 and gfx12. The dispatch checks page_block_size %
   bn0 == 0 at runtime to select the optimal tile. bn0=64 halves KV
   iterations when page_block_size >= 64.

2. Tile dict now supports lists per hdim. The codegen loop iterates
   over all tile variants, generating separate kernel instances for
   each. Combine kernels are unaffected (tile-independent).

3. Enable kMergeNumHeadGroupsSeqLenQ for hdim=64 decode (previously
   hdim=128 only). For GQA-8 with max_seqlen_q=1, this packs 8 head
   groups into the M dimension. Only activates for no-mask instances
   (kernel static_assert requires !kHasMask).

4. Add qr (non-async) pipeline for fwd non-bias group mode as
   fallback after qr_async. The async pipeline on this branch has a
   kernel-level bug where fmha_fwd launches but writes no output.

Made-with: Cursor
2026-04-01 16:24:25 +00:00
root
6729989b97 Fix FMHA split-KV for paged-KV with page_block_size < kN0
Cherry-picked from aghamari/unified-attention-decode-opt (fadf0d585).
- block_masking.hpp: 5-param GetTileRangeAlongX for GenericAttentionMask
- fmha_fwd_splitkv.py: bn0=32 for hdim=64

Made-with: Cursor
2026-04-01 16:24:19 +00:00
root
4c5e290378 Add unified attention (42_unified_attention) and topk_softmax_decode
Squashed from aghamari/unified-attention-decode-opt branch.

42_unified_attention: CK tile paged-KV attention kernel optimized for
decode with 4-tier dispatch (tiny/small/medium/large), 16x16 MFMA,
2D decode grid, head-group merging. Supports hdim=64 GQA-8 and
hdim=128 MHA with block_size=32.

topk_softmax_decode: fused topk + softmax kernel for M=1 MoE decode.

Made-with: Cursor
2026-04-01 16:24:04 +00:00
Chinmay Dattanand Kuchinad
2bb69a24ea [rocm-libraries] ROCm/rocm-libraries#5776 (commit ee1bbcb)
[CK] Fix async pivot mismatch in persistent GEMM kernel
 scheduler (#5776)

## Motivation

Fix pivot mismatch in the persistent GEMM kernel's async input scheduler
that causes **GPU hangs** and incorrect results when used with AsyncTP
(Asynchronous Tensor Parallelism) on ROCm.

PyTorch's `_fused_all_gather_matmul_native` uses this persistent GEMM
kernel with chunk signals to overlap communication and computation. The
pivot mechanism ensures each rank starts computing from its own local
shard first (which is already available), then moves to remote chunks as
they arrive over the network.

Because of the pivot mismatch, the kernel frequently waits on signals
for chunks that have not yet arrived, while attempting to read data from
completely different chunks. This synchronization desync reliably
triggers infinite hangs during multi-GPU native AsyncTP execution. This
fix is required to enable functional AsyncTP support on ROCm.

## Technical Details

In the persistent kernel loop (`UniversalGemmKernel::operator()`), the
M-tile coordinate used for data selection (`i_m`) and the M-tile
coordinate used for the chunk-signal wait (`chunk_idx`) were derived
from inconsistent bases:

* `i_m` was computed from the **unpivoted** tile index `iM`.
* `chunk_idx` was computed from the **pivoted** expression `(iM +
tile_idx_pivot)`.

This means the kernel could wait for chunk N's signal but then read from
chunk M's memory, or vice versa. The mismatch scales with GPU count:
with 2 GPUs ~50% of tiles are wrong, with 4 GPUs ~75%, etc.

**The Fix:**
Introduce a single pivoted M-tile index (`iM_eff`) and derive both `i_m`
and `chunk_idx` from it. This guarantees the kernel always waits for the
correct chunk before reading its data.

*(Note: Minor cosmetic `clang-format` changes were also pulled in
alongside the fix).*

## Test Plan

1. Build PyTorch with this CK change.
2. Run the specific multi-GPU AsyncTP native test:
`timeout 180s env HIP_VISIBLE_DEVICES=0,1 pytest
test/distributed/test_symmetric_memory.py -k
test_fused_all_gather_matmul_native -q -s -x`

## Test Result

Tests verify correct overlapping execution without hangs or accuracy
mismatches when running the AsyncTP native path with non-zero pivots.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-01 16:22:08 +00:00
Jobbins
9426f49b52 [rocm-libraries] ROCm/rocm-libraries#6064 (commit cce30ab)
[CK] poll develop every 15 minutes for changes
2026-04-01 14:35:42 +00:00
Fu-Cheng Tsai
a502e5a00b [rocm-libraries] ROCm/rocm-libraries#5798 (commit 7acd4e7)
[CK_TILE] Update gfx12 FMHA forward kernel configs
2026-04-01 14:23:38 +00:00
aledudek
119712bd90 [rocm-libraries] ROCm/rocm-libraries#4469 (commit 0844cb0)
[CK_TILE] Add pooling in tile_engine

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->
Add pooling in ck tile engine

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-01 07:32:36 +00:00
Yi DING
791afc6465 [rocm-libraries] ROCm/rocm-libraries#5991 (commit 8d85e8e)
[CK_TILE] Fix FMHA BWD IGLP incorrect results due to AGPR
 misallocation (#5991)

## Motivation

After PR #5790 removed the `if constexpr(FmhaMask::IsMasking)` guard
around the
`num_total_loop <= 0` early-exit check, the IGLP pipeline
(`BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`) produces incorrect dK/dV
gradients for
non-masking kernels (even with fix in #5915). Assembly inspection
confirms that the CFG change causes the LLVM
register allocator to reuse AGPR accumulators as scratch destinations in
the dK/dV
reduction loop, breaking the loop-carried accumulation across Q-tile
iterations.

## Technical Details

- Add `[[unlikely]]` to the `num_total_loop <= 0` early-exit in
`BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`. This attribute is load-bearing:
it
restores the CFG shape that the register allocator needs to correctly
assign
  dedicated AGPRs to each column of the dK/dV accumulator.
- Only the IGLP pipeline is affected; the other two BWD pipelines do not
exhibit
  this issue.

## Test Plan

## Test Result

## Submission Checklist

- [x] Look over the contributing guidelines at

https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-01 05:45:19 +00:00
Estevan Vedovelli
a33b5be1b9 [rocm-libraries] ROCm/rocm-libraries#6022 (commit 54b284a)
[CK] contraction: extend GetTypeString() to include
 layout-differentiating params (#6022)

## Motivation

Consumers that identify kernels by their `GetTypeString()` (such as
hipTensor's actor-critic kernel selection, which hashes the string into
a
stable cross-platform UID) were silently dropping one of two colliding
variants during registry insertion.

`GetTypeString()` in `DeviceContractionMultipleD_Xdl_CShuffle`
previously
printed 13 template parameters, omitting
`ABlockTransferSrcScalarPerVector`,
`BBlockTransferSrcScalarPerVector`, `ABlockLdsExtraM`, and
`BBlockLdsExtraN`.

These four parameters determine the block-transfer access width and LDS
padding strategy, and are precisely what differentiates the `kk`, `kn`,
`mk`, and `mn` layout variants from one another when all other geometry
parameters are equal. Two instantiations with identical 13-parameter
strings
are distinct C++ types that accept different stride layouts and reject
each
other's arguments via `IsSupportedArgument`.

This patch extends the output to 17 parameters so that every distinct
template instantiation of this class produces a unique
`GetTypeString()`.

## Technical Details

`include/ck/tensor_operation/gpu/device/impl/device_contraction_multiple_d_xdl_cshuffle.hpp`:
- extend `GetTypeString()` from 13 to 17 parameters including
`ABlockTransferSrcScalarPerVector`,
`BBlockTransferSrcScalarPerVector`, `ABlockLdsExtraM`, and
`BBlockLdsExtraN`.

## Test Plan

Build CK and hipTensor with these changes, and verify hipTensor can
differentiate and select the
correct kernels with layout variations.

## Test Result

CK is building correctly and hipTensor is selecting the kernels
correctly.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-31 15:19:43 +00:00
Bartłomiej Kocot
ef4ff4667d [rocm-libraries] ROCm/rocm-libraries#5842 (commit 04c5690)
[CK][CK Tile] Force padding for atomic_add bf16 C tensor
 (#5842)

## Motivation

Force padding for atomic_add bf16 C tensor to avoid memfaults.

## Technical Details

- add global atomic add for bf16 and enable them
- add padding for atomic add bf16 due to the lack of oob
- remove padding for not continous dims in conv for other cases
- minor bwd data conv fixes

## Test Plan

test_grouped_conv_*_tile

## Test Result

pending

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-31 08:03:41 +00:00
jakpiase
66dc81d530 [rocm-libraries] ROCm/rocm-libraries#5729 (commit 516c974)
[CK_TILE] Changed cshuffle LDS descriptor to naive layout
 (#5729)

## Motivation
This PR changes gemm/convolution cshuffle layout into plain one. to
improve cshuffle operation performance.

## Technical Details
The purpose is that before this change the cshuffle layout was having
some descriptor transformations that were probably aimed at reducing LDS
bank conflicts, but the transformations itself were terribly slow, which
negatively impacted the performance.

## Test Plan
There is no need for additional tests, since current tests cover this
functionality.
2026-03-31 03:40:25 +00:00
Illia Silin
e6b8094f94 [rocm-libraries] ROCm/rocm-libraries#5921 (commit 032ac1b)
[CK] fix clang lifetimebound errors with staging compiler
 (#5921)

## Motivation

The ROCm staging compiler (newer Clang) enforces
`[[clang::lifetimebound]]` annotations on methods that return references
or pointers to internal object data. Without these annotations, the
staging compiler emits compilation errors for container accessor methods
across the CK and CK Tile namespaces.

  ## Technical Details

Adds `[[clang::lifetimebound]]` to all reference/pointer-returning
accessors in core container types:

  **`ck::` namespace:**
  - `Array` -- `At()`, `operator[]`, `operator()`, `begin()`, `end()`
  - `index_array` -- `operator[]`
  - `StaticallyIndexedArray_v2` -- `At()`, `operator[]`, `operator()`
  - `IndexLookupTable` -- `operator[]`

  **`ck_tile::` namespace:**
  - `array` -- `get(i)`, `at()`, `operator[]`, `operator()`
  - `static_array` -- `operator[]`
  - `thread_buffer` -- `get(i)`, `at()`, `operator[]`, `operator()`
  - `make_kernel()` -- parameter pack

Also removes the unused `instance_index` variable from
`batched_gemm_reduce_fp16.cpp` and simplifies its argument parsing
  accordingly.

  ## Test Plan

- Compile with the staging compiler to verify all lifetimebound errors
are resolved
- Existing tests pass unchanged -- the attribute is a compile-time
annotation with no runtime effect

 ## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-30 14:20:20 +00:00
Hosang Yoon
2dcae9d173 [rocm-libraries] ROCm/rocm-libraries#5977 (commit 794bea7)
[CK_TILE] Fix Windows build in FMHA head grouping

## Motivation

This is a follow-up fix for [PR
#5018](https://github.com/ROCm/rocm-libraries/pull/5018).

[PR #5018](https://github.com/ROCm/rocm-libraries/pull/5018) added
LLC-aware FMHA head grouping / head-major scheduling on RDNA, but it
also introduced Linux-only code paths, including `<dirent.h>`, which
break Windows builds. This change fixes that by guarding the
Linux-specific LLC probing logic so non-Linux platforms can still build
correctly.

## Technical Details

- Guard `<dirent.h>` with `#ifdef __linux__`
- Guard KFD sysfs traversal logic with `#if defined(__linux__)`
- On non-Linux platforms, return `0` from
`get_kfd_sysfs_llc_cache_bytes()`
- Preserve existing fallback behavior through:
  - `CK_TILE_FMHA_LLC_CACHE_MB`
  - arch-based default LLC sizes
  - no head grouping when no LLC size can be resolved

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-30 14:19:19 +00:00
Jeff Huang
7968368d92 [rocm-libraries] ROCm/rocm-libraries#5918 (commit a7e2c67)
[CK][CK_TILE] Add fp8bf16 hdim=256 tile for batch prefill
 (#5918)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation
FP8 batch prefill kernels currently only support head_dim=128. Models
with head_dim=256 hit the "invalid argument for batch_prefill" error
because no matching kernel variant exists in the codegen dispatch.

## Technical Details
Add a hdim=256 tile size entry for fp8bf16 in the batch prefill codegen
recipe (`fmha_batch_prefill.py`).

Tile configuration: `FmhaFwdTileSize(128, 128, 32, 256, 32, 256, 4,1,1,
4,1,1, 32,32,32, 32,32,32, -1)`
- bm0=128, bn0=128 (Q/K tile sizes)
- bn1=256, bk0max=256 (V head_dim=256)
- Warp MFMA 32x32x32 (fp8 MFMA instructions)

This mirrors the existing bf16/fp16 hdim=256 tile but uses fp8 warp
sizes.

## Test Plan
Tested on both MI308X (gfx942) and MI355X (gfx950) via aiter batch
prefill test with the following matrix:
- page_size: {1, 16, 1024}
- kv_layout: {linear, vectorized}
- lookup_table: {sglang, vllm}
- causal: {true, false}
- logits_soft_cap: {0.0, 30.0}
- contiguous_kv: {true, false}

## Test Result

**MI308X (gfx942):** 160 passed, 32 skipped (page_size=1 + vectorized
not applicable)
**MI355X (gfx950):** 120 passed, 72 skipped (pre-existing ROCm 7.2
compiler issue with causal + no softcap)

No register spills on either platform.

### Profiling — MI355X (gfx950), FP8 pertensor, hdim=256, seqlen=1024, 8
heads

| page_sz | kv_layout | table | causal | soft_cap | time_us | TFLOPS |
|---------|-----------|-------|--------|----------|---------|--------|
| 1 | linear | sglang | False | 0.00 | 55.01 | 156.16 |
| 1 | linear | vllm | False | 0.00 | 55.12 | 155.84 |
| 1 | linear | sglang | False | 30.00 | 62.63 | 137.16 |
| 1 | linear | vllm | False | 30.00 | 62.16 | 138.20 |
| 1 | linear | sglang | True | 30.00 | 64.09 | 67.01 |
| 1 | linear | vllm | True | 30.00 | 63.85 | 67.27 |
| 16 | linear | sglang | False | 0.00 | 57.00 | 150.69 |
| 16 | vectorized | sglang | False | 0.00 | 57.55 | 149.25 |
| 16 | linear | vllm | False | 0.00 | 56.80 | 151.23 |
| 16 | vectorized | vllm | False | 0.00 | 57.32 | 149.87 |
| 16 | linear | sglang | False | 30.00 | 64.77 | 132.62 |
| 16 | vectorized | vllm | False | 30.00 | 63.54 | 135.18 |
| 16 | linear | sglang | True | 30.00 | 66.84 | 64.26 |
| 16 | vectorized | vllm | True | 30.00 | 66.12 | 64.96 |
| 1024 | linear | sglang | False | 0.00 | 58.25 | 147.46 |
| 1024 | vectorized | sglang | False | 0.00 | 57.53 | 149.31 |
| 1024 | linear | vllm | False | 0.00 | 58.06 | 147.94 |
| 1024 | vectorized | vllm | False | 0.00 | 57.55 | 149.27 |
| 1024 | linear | sglang | False | 30.00 | 65.38 | 131.38 |
| 1024 | vectorized | vllm | False | 30.00 | 63.64 | 134.98 |
| 1024 | linear | sglang | True | 30.00 | 66.85 | 64.25 |
| 1024 | vectorized | vllm | True | 30.00 | 65.26 | 65.81 |

### Profiling — MI308X (gfx942), FP8 pertensor, hdim=256, seqlen=1024, 8
heads

| page_sz | kv_layout | table | causal | soft_cap | time_us | TFLOPS |
|---------|-----------|-------|--------|----------|---------|--------|
| 1 | linear | sglang | False | 0.00 | 110.18 | 77.96 |
| 1 | linear | vllm | True | 30.00 | 134.33 | 31.97 |
| 1 | linear | sglang | True | 30.00 | 134.59 | 31.91 |
| 16 | linear | sglang | False | 0.00 | 115.43 | 74.42 |
| 16 | vectorized | sglang | False | 0.00 | 106.11 | 80.95 |
| 16 | linear | vllm | False | 0.00 | 116.34 | 73.83 |
| 16 | vectorized | vllm | False | 0.00 | 106.17 | 80.91 |
| 16 | linear | sglang | False | 30.00 | 135.61 | 63.34 |
| 16 | vectorized | vllm | False | 30.00 | 122.37 | 70.20 |
| 16 | linear | sglang | True | 0.00 | 117.44 | 36.57 |
| 16 | vectorized | vllm | True | 0.00 | 108.81 | 39.47 |
| 16 | linear | sglang | True | 30.00 | 139.43 | 30.80 |
| 16 | vectorized | vllm | True | 30.00 | 125.87 | 34.12 |
| 1024 | linear | sglang | False | 0.00 | 110.65 | 77.63 |
| 1024 | vectorized | sglang | False | 0.00 | 101.70 | 84.46 |
| 1024 | linear | vllm | False | 0.00 | 111.71 | 76.89 |
| 1024 | vectorized | vllm | False | 0.00 | 101.55 | 84.59 |
| 1024 | linear | sglang | False | 30.00 | 129.33 | 66.42 |
| 1024 | vectorized | vllm | False | 30.00 | 120.95 | 71.02 |
| 1024 | linear | sglang | True | 0.00 | 112.26 | 38.26 |
| 1024 | vectorized | vllm | True | 0.00 | 103.02 | 41.69 |
| 1024 | linear | sglang | True | 30.00 | 133.73 | 32.12 |
| 1024 | vectorized | vllm | True | 30.00 | 124.75 | 34.43 |

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-30 10:21:33 +00:00
Yi DING
fb64a4453c [rocm-libraries] ROCm/rocm-libraries#5915 (commit a72cf7d)
[CK_TILE] Fix FMHA BWD register pressure by wrapping
 num_total_loop with amd_wave_read_first_lane (#5915)

## Motivation

In three FMHA backward pipelines, `num_total_loop` is computed without
`amd_wave_read_first_lane()`, so the compiler treats it as a VGPR even
though it is logically uniform across all lanes. This raises register
pressure, and under high pressure the compiler may reuse VGPRs across
overlapping live ranges. This was confirmed via assembly inspection: the
compiler reused `v52:v53` as both the B-matrix input for dK MFMAs and an
intermediate value for dV, producing incorrect dK/dV gradients.

## Technical Details

Wrap `num_total_loop` with `amd_wave_read_first_lane()` in three
pipelines:
- `block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr`
- `block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr_iglp`
- `block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr`

This promotes `num_total_loop` to an SGPR, eliminating the excess
register pressure and the incorrect VGPR reuse.

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-30 01:45:16 +00:00
Jan Patrick Lehr
b6bbada9f1 [rocm-libraries] ROCm/rocm-libraries#5639 (commit a65e645)
[CK] More lifetime-warning suppression

## Motivation

The staging compiler picked up another change from upstream that leads
to more lifetime-analysis warnings. This breaks the build, given CK is
built with -Werror. As a result, compiler promotion is blocked.

## Technical Details
This patch adds the pragma push diagnostics to ignore the
lifetime-warnings in the modified files to unblock compiler promotion.

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-28 11:20:51 +00:00
Linjun-AMD
3b55a05e71 [rocm-libraries] ROCm/rocm-libraries#5849 (commit d9b89b2)
[CK_TILE ]Revert "[CK_TILE] Enable MXFP6 for MX GEMM op
 (#5095)" (#5849)

This reverts commit 7e55766ddf7e9e20791b0e4e2d7b4026cf16b637.

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-27 20:37:23 +00:00
Bartłomiej Kocot
c28d0033d7 [rocm-libraries] ROCm/rocm-libraries#5785 (commit d8ecfc1)
[CK] Fix min k_batch calculation in conv kernels

## Motivation

Avoid division by 0 and remove not needed "-1".

## Technical Details

Our div up implementation return lower value if input is divisible.
There is no need to subtract 1.

## Test Plan

test_grouped_conv_bwd_weight

## Test Result

Passed locally.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

AICK-1019
2026-03-27 15:38:21 +00:00
Illia Silin
4c926497ad [rocm-libraries] ROCm/rocm-libraries#5829 (commit 19b2813)
[CK] Fix error in dockerfile when building staging compiler.
 (#5829)

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-27 15:37:21 +00:00
Johannes Graner
58475d3f45 [rocm-libraries] ROCm/rocm-libraries#5393 (commit d51b649)
[CK Tile] StreamK support for Bwd Weight grouped convolutions
 (#5393)

## Motivation

Add StreamK work distribution to the CK Tile grouped convolution
backward weight kernel. Split-K divides the K-dimension uniformly across
a fixed `k_batch`, which causes load imbalance when the number of output
tiles doesn't evenly fill the GPU. StreamK distributes total
K-iterations evenly across workgroups, improving utilization on these
shapes.

## Technical Details

StreamK is added as an `if constexpr` branch in the existing kernel,
selected by the `TilePartitioner_` template parameter. Two reduction
strategies are supported:
- **Linear**: tile-starter sequentially accumulates partials from
contributing CTAs
- **Tree**: pairwise binary tree reduction (O(log n) depth, faster for
many contributors)

Both persistent and non-persistent data-parallel (DP) sections are
supported.

Key changes:
- `grouped_convolution_backward_weight_kernel.hpp`: StreamK execution
path with `RunStreamK`/`RunStreamKLoop`, partial store/load via
workspace, flag-based cross-CTA synchronization,
`GridSize`/`MakeKernelArgs`/`GetWorkSpaceSize` extensions
- `streamk_common.hpp`: Shared `StreamKReductionOps` (reduction helpers)
and `StreamKDispatch` (persistent/non-persistent DP dispatch), used by
both GEMM and Conv StreamK kernels
- `streamk_gemm_kernel.hpp`: Refactored to use shared helpers
- Merged split-K and StreamK example invokers via `PartitionerPolicy`
template parameter
- StreamK example binary with `--streamk_reduction=linear|tree` and
`--streamk_persistent=0|1`
- CK Builder integration: `SpecifiesStreamK` concept,
`TilePartitionerType` factory helper, `InstanceTraits` with StreamK
fields
- 30 tests: host-side, GPU end-to-end (Linear + Tree + Persistent DP),
negative, builder regression

### Performance (MI355X, gfx950)

Speedup relative to best split-K (sweep over k_batch={1,2,4,8,16,32}):

| Shape | 16x64 tiles | | 128x128 tiles | |
|---|---|---|---|---|
| | Split-K | StreamK | Split-K | StreamK |
| 1x1 128x128 N=32 28x28 | 1.00x | 0.54x | 1.00x | 0.81x |
| 3x3 128x128 N=32 14x14 | 1.00x | 0.59x | 1.00x | 0.62x |
| 1x1 256x64 N=32 56x56 | 1.00x | 0.83x | 1.00x | 1.83x |
| 3x3 512x512 N=2 7x7 | 1.00x | 1.12x | 1.00x | 0.62x |
| 1x1 1024x1024 N=4 7x7 | 1.00x | 1.09x | 1.00x | 0.60x |
| 3x3 128x128 N=32 28x28 | 1.00x | 0.44x | 1.00x | 0.96x |
| 3x3 256x256 N=32 14x14 | 1.00x | 0.67x | 1.00x | 0.93x |
| 3x3 512x512 N=32 7x7 | 1.00x | 0.98x | 1.00x | 1.16x |

StreamK's value depends on tile config: with larger tiles (fewer output
tiles), StreamK delivers up to 1.83x speedup on bottleneck shapes and up
to 1.16x on typical large-channel convolutions. Tree reduction
consistently outperforms Linear when multiple CTAs contribute to the
same tile (up to 2.87x faster), due to O(log n) reduction depth vs O(n)
sequential accumulation. The table reports the best of Linear and Tree
for each shape.

## Test Plan

```bash
ninja -C build test_ck_tile_grouped_conv_bwd_weight_streamk
./build/bin/test_ck_tile_grouped_conv_bwd_weight_streamk

# Builder tests (requires CK_EXPERIMENTAL_BUILDER=ON)
ninja -C build check-builder
```

30 tests covering:
- Host-side: type traits, kernel args construction, grid size, workspace
size
- GPU end-to-end (Linear + Tree): small/medium shapes, multi-group,
stride>1, pure-DP degeneration, single-tile all-SK, large GemmK, higher
occupancy
- Persistent DP: Linear + Tree with persistent data-parallel dispatch
- Negative: `IsSupportedArgument` rejects unaligned K and C
- Builder: Create (instance string validation) + Execution (reference
comparison) + instance string regression

## Test Result

All 30 conv StreamK tests pass on MI355X (gfx950). 64/64 GEMM StreamK
tests pass. Full `check-builder` suite passes. Tolerances computed
dynamically using `calculate_rtol_atol` pattern (fp16 ULP-aware).

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-27 09:18:14 +00:00
arai713
36f2ec23f5 [rocm-libraries] ROCm/rocm-libraries#5445 (commit 2cdbf8b)
[CK_TILE] Support for CompV4 pipeline in Stream-K GEMM
 (#5445)

## Motivation
This PR is extending the pipeline support for Stream-K GEMM by adding
the CompV4 pipeline. Additional pipelines will be added in subsequent
PRs.

## Technical Details

- Enable the CompV4 pipeline by adding an option to set DoubleSMemBuffer
to true if the CompV4 pipeline has been selected as it requires double
buffered shared memory
- Addition of CompV4 pipeline into the extended tests: kernel instances
mirror the existing CompV3/Mem configurations (same layout permutations,
data types, and tile sizes) with the pipeline type set to CompV4.
- Addition of CompV4 pipeline into smoke tests (generated using Tile
Engine)

## Test Plan
These were tested using the existing smoke and extended tests.

## Test Result
All tests passed
## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-27 08:13:27 +00:00
Yi DING
47a04fda08 [rocm-libraries] ROCm/rocm-libraries#5790 (commit c132b5a)
[CK_TILE] Fix NaN for FMHA BWD When seq_q=0

## Motivation
This PR addresses NaNs in the FMHA backward (dQ/dK/dV) path when the
effective query sequence length for a tile is zero, by ensuring the
per-tile pipelines exit early with zeroed accumulators and by avoiding
an early kernel return that prevented writing out cleared gradients.

## Technical Details
- Add unconditional early-exit in the dK/dV pipelines when
`num_total_loop <= 0` (no work), returning zeroed accumulators.
- Adjust group-mode kernel early-return logic to only return when
**both** `seqlen_q` and `seqlen_k` are zero, allowing blocks to run and
store cleared dK/dV when `seqlen_q == 0`.

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-27 07:54:53 +00:00
Yaswanth Raparti
e2470e837a [rocm-libraries] ROCm/rocm-libraries#5880 (commit a6b6c05)
[CK][CK_TILE] Fix CTest parsing to handle all test number
 formats (#5880)

## Motivation
Fix a bug in the smart-build --ctest-only filter that was incorrectly
excluding tests with numbers less than 100.

## Technical Details
The issue was caused by CTest formatting test numbers with variable
spacing based on the number of digits:
  - "Test   `#1`: name (3 spaces for tests 1-9)"
  - "Test  `#79`: name (2 spaces for tests 10-99)"
  - "Test `#100`: name (1 space for tests 100+)"

The previous code used `line.strip().startswith("Test #")` which only
matched tests with a single space (i.e., test numbers >= 100).

This caused tests like ck_tile_unit_sequence (Test #79) to be excluded
from smart-build test selection, resulting in CTest failures when the
binary wasn't built.

Solution: Replace string matching with a regex pattern that handles
all spacing variations: r'^\s*Test\s+#\d+:\s*(.+)$'

## Test Plan
Tested with test numbers from 1 to 12345.

## Test Result
  - Before: 48 tests selected (only tests #100+)
  - After: 146 tests selected (all CTest-registered tests)

## Submission Checklist

- [x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-03-27 06:34:12 +00:00
Illia Silin
2f98c7bbef [rocm-libraries] ROCm/rocm-libraries#5891 (commit 82563ff)
fix AITER docker setup

## Motivation

Add a new python package required to build AITER.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-27 04:36:16 +00:00
Bartłomiej Kocot
1c95ce0668 [rocm-libraries] ROCm/rocm-libraries#5856 (commit 2d9a0a1)
[CK] Fix unused param mask

## Motivation

Compiler error caused by unused param mask.

## Technical Details

Skip tests using param mask in test loop.

## Test Plan

Current test improvements.

## Test Result

Passed locally

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-27 03:58:37 +00:00
dependabot[bot]
6215bb8dbc [rocm-libraries] ROCm/rocm-libraries#5896 (commit b7436b5)
Bump requests from 2.32.5 to 2.33.0 in
 /projects/composablekernel/docs/sphinx (#5896)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Bumps [requests](https://github.com/psf/requests) from 2.32.5 to 2.33.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/psf/requests/releases">requests's
releases</a>.</em></p>
<blockquote>
<h2>v2.33.0</h2>
<h2>2.33.0 (2026-03-25)</h2>
<p><strong>Announcements</strong></p>
<ul>
<li>📣 Requests is adding inline types. If you have a typed code base
that uses Requests, please take a look at <a
href="https://redirect.github.com/psf/requests/issues/7271">#7271</a>.
Give it a try, and report any gaps or feedback you may have in the
issue. 📣</li>
</ul>
<p><strong>Security</strong></p>
<ul>
<li>CVE-2026-25645 <code>requests.utils.extract_zipped_paths</code> now
extracts contents to a non-deterministic location to prevent malicious
file replacement. This does not affect default usage of Requests, only
applications calling the utility function directly.</li>
</ul>
<p><strong>Improvements</strong></p>
<ul>
<li>Migrated to a PEP 517 build system using setuptools. (<a
href="https://redirect.github.com/psf/requests/issues/7012">#7012</a>)</li>
</ul>
<p><strong>Bugfixes</strong></p>
<ul>
<li>Fixed an issue where an empty netrc entry could cause malformed
authentication to be applied to Requests on Python 3.11+. (<a
href="https://redirect.github.com/psf/requests/issues/7205">#7205</a>)</li>
</ul>
<p><strong>Deprecations</strong></p>
<ul>
<li>Dropped support for Python 3.9 following its end of support. (<a
href="https://redirect.github.com/psf/requests/issues/7196">#7196</a>)</li>
</ul>
<p><strong>Documentation</strong></p>
<ul>
<li>Various typo fixes and doc improvements.</li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a href="https://github.com/M0d3v1"><code>@​M0d3v1</code></a> made
their first contribution in <a
href="https://redirect.github.com/psf/requests/pull/6865">psf/requests#6865</a></li>
<li><a href="https://github.com/aminvakil"><code>@​aminvakil</code></a>
made their first contribution in <a
href="https://redirect.github.com/psf/requests/pull/7220">psf/requests#7220</a></li>
<li><a href="https://github.com/E8Price"><code>@​E8Price</code></a> made
their first contribution in <a
href="https://redirect.github.com/psf/requests/pull/6960">psf/requests#6960</a></li>
<li><a href="https://github.com/mitre88"><code>@​mitre88</code></a> made
their first contribution in <a
href="https://redirect.github.com/psf/requests/pull/7244">psf/requests#7244</a></li>
<li><a href="https://github.com/magsen"><code>@​magsen</code></a> made
their first contribution in <a
href="https://redirect.github.com/psf/requests/pull/6553">psf/requests#6553</a></li>
<li><a
href="https://github.com/Rohan5commit"><code>@​Rohan5commit</code></a>
made their first contribution in <a
href="https://redirect.github.com/psf/requests/pull/7227">psf/requests#7227</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/psf/requests/blob/main/HISTORY.md#2330-2026-03-25">https://github.com/psf/requests/blob/main/HISTORY.md#2330-2026-03-25</a></p>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/psf/requests/blob/main/HISTORY.md">requests's
changelog</a>.</em></p>
<blockquote>
<h2>2.33.0 (2026-03-25)</h2>
<p><strong>Announcements</strong></p>
<ul>
<li>📣 Requests is adding inline types. If you have a typed code base
that
uses Requests, please take a look at <a
href="https://redirect.github.com/psf/requests/issues/7271">#7271</a>.
Give it a try, and report
any gaps or feedback you may have in the issue. 📣</li>
</ul>
<p><strong>Security</strong></p>
<ul>
<li>CVE-2026-25645 <code>requests.utils.extract_zipped_paths</code> now
extracts
contents to a non-deterministic location to prevent malicious file
replacement. This does not affect default usage of Requests, only
applications calling the utility function directly.</li>
</ul>
<p><strong>Improvements</strong></p>
<ul>
<li>Migrated to a PEP 517 build system using setuptools. (<a
href="https://redirect.github.com/psf/requests/issues/7012">#7012</a>)</li>
</ul>
<p><strong>Bugfixes</strong></p>
<ul>
<li>Fixed an issue where an empty netrc entry could cause
malformed authentication to be applied to Requests on
Python 3.11+. (<a
href="https://redirect.github.com/psf/requests/issues/7205">#7205</a>)</li>
</ul>
<p><strong>Deprecations</strong></p>
<ul>
<li>Dropped support for Python 3.9 following its end of support. (<a
href="https://redirect.github.com/psf/requests/issues/7196">#7196</a>)</li>
</ul>
<p><strong>Documentation</strong></p>
<ul>
<li>Various typo fixes and doc improvements.</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="bc04dfd6da"><code>bc04dfd</code></a>
v2.33.0</li>
<li><a
href="66d21cb07b"><code>66d21cb</code></a>
Merge commit from fork</li>
<li><a
href="8b9bc8fc0f"><code>8b9bc8f</code></a>
Move badges to top of README (<a
href="https://redirect.github.com/psf/requests/issues/7293">#7293</a>)</li>
<li><a
href="e331a288f3"><code>e331a28</code></a>
Remove unused extraction call (<a
href="https://redirect.github.com/psf/requests/issues/7292">#7292</a>)</li>
<li><a
href="753fd08c5e"><code>753fd08</code></a>
docs: fix FAQ grammar in httplib2 example</li>
<li><a
href="774a0b837a"><code>774a0b8</code></a>
docs(socks): same block as other sections</li>
<li><a
href="9c72a41bec"><code>9c72a41</code></a>
Bump github/codeql-action from 4.33.0 to 4.34.1</li>
<li><a
href="ebf7190679"><code>ebf7190</code></a>
Bump github/codeql-action from 4.32.0 to 4.33.0</li>
<li><a
href="0e4ae38f0c"><code>0e4ae38</code></a>
docs: exclude Response.is_permanent_redirect from API docs (<a
href="https://redirect.github.com/psf/requests/issues/7244">#7244</a>)</li>
<li><a
href="d568f47278"><code>d568f47</code></a>
docs: clarify Quickstart POST example (<a
href="https://redirect.github.com/psf/requests/issues/6960">#6960</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/psf/requests/compare/v2.32.5...v2.33.0">compare
view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=requests&package-manager=pip&previous-version=2.32.5&new-version=2.33.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
2026-03-26 22:01:37 +00:00
joyeamd
046d3ac274 [rocm-libraries] ROCm/rocm-libraries#5789 (commit 6654ca6)
[CK][CK_TILE] Revert addional oob check in gemm IsSupported
 function (#5789)

## Motivation

fix ck_tile's oob check.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-26 01:41:35 +00:00
Estevan Vedovelli
0004a37de5 [rocm-libraries] ROCm/rocm-libraries#5675 (commit fbd7fa7)
[CK] Properly build HIPTENSOR_REQ_LIBS_ONLY targets when used
 in addition to MIOPEN_REQ_LIBS_ONLY (#5675)

## Motivation

When building CK with both -DHIPTENSOR_REQ_LIBS_ONLY=ON and
-DMIOPEN_REQ_LIBS_ONLY=ON, only MIOpen targets were being properly
installed. This change is necessary to allow hipTensor to build with
TheRock without the need to rebuild CK from source.

## Technical Details

The solutions consists in considering both HIPTENSOR_REQ_LIBS_ONLY and
MIOPEN_REQ_LIBS_ONLY when including hiptensor's targets in CMake,
following the same approach used to the conv target (for MIOpen).

## Test Plan

Manually test the build and installation with
`-DHIPTENSOR_REQ_LIBS_ONLY=ON` and both `-DHIPTENSOR_REQ_LIBS_ONLY=ON
-DMIOPEN_REQ_LIBS_ONLY=ON`, and verify that the proper files as
installed.

## Test Result

The build with `-DHIPTENSOR_REQ_LIBS_ONLY=ON` properly includes the
targets contraction, reduction and other, while
`-DHIPTENSOR_REQ_LIBS_ONLY=ON -DMIOPEN_REQ_LIBS_ONLY=ON` includes conv,
contraction, reduction and other.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-25 23:59:53 +00:00
Illia Silin
86ec92f925 [rocm-libraries] ROCm/rocm-libraries#5571 (commit 8f60932)
[CK] fix clang lifetime bound error in ck_builder.

## Motivation

This resolves the compilation error with latest develop compiler branch.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-25 16:45:38 +00:00
Illia Silin
bee61860c2 [rocm-libraries] ROCm/rocm-libraries#5764 (commit f3c1232)
Re-enable daily builds with staging compiler

## Motivation

This should help us catch and fix any new compilation issues early on.

## Technical Details

We now have three compiler profiles:
* **develop**: slightly stabilized version of amd-staging with some of
the obvious offending PRs reverted, 1-2 weeks behind amd-staging;
* **amd-mainline**: more stable version of compiler, the baseline for
all other branches, e.g., release, npi, etc. 2-4 weeks behind
amd-staging.
* **amd-staging**: latest compiler version where all new PRs land, often
broken;

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: kensclin <lshyhchy@amd.com>
2026-03-25 16:37:58 +00:00
Ville Pietilä
ec2dbfbfde [rocm-libraries] ROCm/rocm-libraries#5516 (commit ff3afda)
[CK_TILE, CK_BUILDER] Add bwd data to CK Tile profiler
 (#5516)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

We want close the performance gap between old CK and CK Tile for bwd
data convolutions. To achieve this, we need tow things

- Configurations for the old CK kernel instances such that we can map
them into CK Tile instances.
- Support in CK profiler to run the CK Tile instance with the same API
as for old CK instances.

## Technical Details

Extracted kernel configurations from old CK. The codegen python script
for CK Tile convs is extended to support also bwd data. The generated
instances are added to the CMake build (target
`device_grouped_conv_bwd_data_tile_instances`).
A new profiler op (`grouped_conv_bwd_data_tile`) has been added to the
CK Profiler. The API is same as for old CK's profiler op
`grouped_conv_bwd_data`.
2026-03-25 14:36:11 +00:00