composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-14 02:02:46 +00:00

Author	SHA1	Message	Date
root	393ebc1a50	WIP backup: snapshot all local notes, slides, tutorials, and kernel work Backup commit grouping all in-progress local work so nothing is lost: - Modified CK-UA kernel + example sources (unified_attention.cpp, unified_attention_kernel.hpp) and CMake/build files. - Updated dispatcher README and ctypes_utils.py. - New unified_attention example notes: PARAMETERS.md, VARIABLES.md. - New unified_attention instances for d128 fp16/bf16 (mask/nmask, gqa6). - New 99_toy_tutorial/ collection: bank-conflict investigations (test_.cpp, .js, .gdb, .asm, .md), tile distribution / row reduction / calling_gemm / thread_buffer tutorials. - Slide decks and supporting assets (bank_conflict_slides.qmd/.html, tile_distribution_slides.qmd, assets/, _files/, step1_reshape_only, xor_full_steps_simple). - GDB helper script (break_on_ds_read.gdb). Not intended for upstream review; pure WIP snapshot.	2026-05-11 20:34:52 +00:00
root	3f076a6fc1	Add IsOutOfSinkBound alias in GenericAttentionMask (API compatibility) Made-with: Cursor	2026-04-23 08:17:34 +00:00
root	8506db8761	Fix int32 overflow in CK-UA pipeline via pointer rebasing tensor_coordinate::get_offset() returns index_t (int32), causing overflow when page_idx * block_size * stride > 2^31 (~131K blocks for d64/GQA-8). Fix: rebase K/V data pointer for each page using int64 arithmetic instead of set_window_origin with large offsets. After rebasing p_data_ and buffer_size_, call init_raw() to refresh the AMD buffer resource descriptor, then set_window_origin({0,0}) to reset cached coordinates. Tested: num_blocks up to 2M with nkh=1/8, blk=32/64. All pass. Made-with: Cursor	2026-04-02 09:39:07 +00:00
root	e8587b86c2	Fix CK-UA pipeline: s_waitcnt_vmcnt<0> in fmha_post_process The final V tile's async load was not properly waited on before reading from LDS: s_waitcnt_vmcnt<K_inst> allowed V_inst outstanding loads (a no-op when K_inst == V_inst). The last loop iteration never prefetches K, so only V is outstanding. Use s_waitcnt_vmcnt<0> unconditionally. This partially fixes the BS32 race condition for production workloads (maxk >= 256). A deeper pipeline race remains for very short KV sequences (maxk < ~165, 2-5 pages) with block_size=32 at high batch. Made-with: Cursor	2026-04-01 23:04:07 +00:00
root	87d16738bf	WIP: CK-UA KV-segment parallelism - kernel args and split range Added split-KV fields to UnifiedAttentionVarlenKargs (num_splits, i_split, lse_acc_ptr, o_acc_ptr + strides). Modified operator() to compute per-split KV range using blocks_per_split. INCOMPLETE: The pipeline returns normalized o_acc but the split-KV combine kernel needs unnormalized o_acc + lse. Need to modify the pipeline to optionally return m and l values alongside o_acc. The kernel changes compile but the epilogue needs the split path (write to float accumulators instead of final output). Made-with: Cursor	2026-04-01 19:09:59 +00:00
root	63821af1ff	Add split-KV decode tiles (b16x32, b32x32) + fix num_splits heuristic Decode tiles for split-KV hdim=64: bm0=16/1-warp and bm0=32/2-warp. Fix num_splits to use num_heads_kv (not num_heads_q) and target 4x SMs. Performance unchanged (0.056ms) because: 1. Split+combine overhead dominates for short KV (31 pages) 2. Triton 3D's single-kernel split avoids combine kernel entirely Made-with: Cursor	2026-04-01 18:49:16 +00:00
root	c5600bc8ae	Add decode tiles (b16x32, b32x32) to pagedkv_prefill codegen with max_seqlen_q dispatch Made-with: Cursor	2026-04-01 18:30:06 +00:00
root	65a3f88ad8	Fix CK-UA mixed batch: use max_seqlen_q for tier selection Decode grid (num_kv_heads, num_seqs) assumes each seq has <= kBlockQ tokens. For mixed batches (decode + prefill), avg_q is low but some seqs have hundreds of tokens, causing truncation. Added max_seqlen_q to args and check it in select_tile_tier to force medium tier (1D grid with Q tile iteration) for mixed batches. 362/362 no-window shapes now pass. Made-with: Cursor	2026-04-01 18:09:48 +00:00
root	07ba03bcbf	Fix sliding window mask: use window_generic when left >= 0 mask_info::decode('b:left,right,sink') always created mask_bottom_right (IsLocal=false) which ignores the left window boundary. For sliding window attention (left >= 0), use window_generic (IsLocal=true) so the kernel respects the left boundary. Fixes: CK split-KV producing identical results with and without sliding window. Now 724/724 shapes pass correctness vs Triton. Made-with: Cursor	2026-04-01 18:00:19 +00:00
root	e5272603c9	Wire FmhaFwdPagedKV: enable bf16 hdim=64 with bn0=32 for page_block_size=32 Made-with: Cursor	2026-04-01 17:18:41 +00:00
root	10564b0c40	Enable FmhaFwdPagedKV bf16 hdim=64 instances (was commented out) Made-with: Cursor	2026-04-01 16:49:20 +00:00
root	cd7ba6e2e8	Add unified attention (42_unified_attention) Squashed from aghamari/unified-attention-decode-opt branch. CK tile paged-KV attention kernel optimized for decode with 4-tier dispatch (tiny/small/medium/large), 16x16 MFMA, 2D decode grid, head-group merging. Supports hdim=64 GQA-8 and hdim=128 MHA with block_size=32. Made-with: Cursor	2026-04-01 16:39:15 +00:00
root	ec2db01e4a	Fix fmha_fwd early-exit bug: seqlen_q <= min_seqlen_q should be < The kSkipMinSeqlenQ optimization incorrectly used <= comparison, causing the kernel to skip batches where seqlen_q equals min_seqlen_q. This happens in the common case of no padding (all batches have the same seqlen_q == min_seqlen_q), producing all-zero output silently. Changed to strict < so batches with exactly min_seqlen_q tokens are still processed. Made-with: Cursor	2026-04-01 16:24:31 +00:00
root	cb6fb2802d	Split-KV codegen: dual-tile dispatch and head-merge for hdim=64 1. Dual-tile: add both bn0=64 (preferred) and bn0=32 (fallback) for hdim=64 on gfx9 and gfx12. The dispatch checks page_block_size % bn0 == 0 at runtime to select the optimal tile. bn0=64 halves KV iterations when page_block_size >= 64. 2. Tile dict now supports lists per hdim. The codegen loop iterates over all tile variants, generating separate kernel instances for each. Combine kernels are unaffected (tile-independent). 3. Enable kMergeNumHeadGroupsSeqLenQ for hdim=64 decode (previously hdim=128 only). For GQA-8 with max_seqlen_q=1, this packs 8 head groups into the M dimension. Only activates for no-mask instances (kernel static_assert requires !kHasMask). 4. Add qr (non-async) pipeline for fwd non-bias group mode as fallback after qr_async. The async pipeline on this branch has a kernel-level bug where fmha_fwd launches but writes no output. Made-with: Cursor	2026-04-01 16:24:25 +00:00
root	6729989b97	Fix FMHA split-KV for paged-KV with page_block_size < kN0 Cherry-picked from aghamari/unified-attention-decode-opt (fadf0d585). - block_masking.hpp: 5-param GetTileRangeAlongX for GenericAttentionMask - fmha_fwd_splitkv.py: bn0=32 for hdim=64 Made-with: Cursor	2026-04-01 16:24:19 +00:00
root	4c5e290378	Add unified attention (42_unified_attention) and topk_softmax_decode Squashed from aghamari/unified-attention-decode-opt branch. 42_unified_attention: CK tile paged-KV attention kernel optimized for decode with 4-tier dispatch (tiny/small/medium/large), 16x16 MFMA, 2D decode grid, head-group merging. Supports hdim=64 GQA-8 and hdim=128 MHA with block_size=32. topk_softmax_decode: fused topk + softmax kernel for M=1 MoE decode. Made-with: Cursor	2026-04-01 16:24:04 +00:00
Chinmay Dattanand Kuchinad	2bb69a24ea	[rocm-libraries] ROCm/rocm-libraries#5776 (commit ee1bbcb) [CK] Fix async pivot mismatch in persistent GEMM kernel scheduler (#5776) ## Motivation Fix pivot mismatch in the persistent GEMM kernel's async input scheduler that causes GPU hangs and incorrect results when used with AsyncTP (Asynchronous Tensor Parallelism) on ROCm. PyTorch's `_fused_all_gather_matmul_native` uses this persistent GEMM kernel with chunk signals to overlap communication and computation. The pivot mechanism ensures each rank starts computing from its own local shard first (which is already available), then moves to remote chunks as they arrive over the network. Because of the pivot mismatch, the kernel frequently waits on signals for chunks that have not yet arrived, while attempting to read data from completely different chunks. This synchronization desync reliably triggers infinite hangs during multi-GPU native AsyncTP execution. This fix is required to enable functional AsyncTP support on ROCm. ## Technical Details In the persistent kernel loop (`UniversalGemmKernel::operator()`), the M-tile coordinate used for data selection (`i_m`) and the M-tile coordinate used for the chunk-signal wait (`chunk_idx`) were derived from inconsistent bases: * `i_m` was computed from the unpivoted tile index `iM`. * `chunk_idx` was computed from the pivoted expression `(iM + tile_idx_pivot)`. This means the kernel could wait for chunk N's signal but then read from chunk M's memory, or vice versa. The mismatch scales with GPU count: with 2 GPUs ~50% of tiles are wrong, with 4 GPUs ~75%, etc. The Fix: Introduce a single pivoted M-tile index (`iM_eff`) and derive both `i_m` and `chunk_idx` from it. This guarantees the kernel always waits for the correct chunk before reading its data. (Note: Minor cosmetic `clang-format` changes were also pulled in alongside the fix). ## Test Plan 1. Build PyTorch with this CK change. 2. Run the specific multi-GPU AsyncTP native test: `timeout 180s env HIP_VISIBLE_DEVICES=0,1 pytest test/distributed/test_symmetric_memory.py -k test_fused_all_gather_matmul_native -q -s -x` ## Test Result Tests verify correct overlapping execution without hangs or accuracy mismatches when running the AsyncTP native path with non-zero pivots. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-01 16:22:08 +00:00
Jobbins	9426f49b52	[rocm-libraries] ROCm/rocm-libraries#6064 (commit cce30ab) [CK] poll develop every 15 minutes for changes	2026-04-01 14:35:42 +00:00
Fu-Cheng Tsai	a502e5a00b	[rocm-libraries] ROCm/rocm-libraries#5798 (commit 7acd4e7) [CK_TILE] Update gfx12 FMHA forward kernel configs	2026-04-01 14:23:38 +00:00
aledudek	119712bd90	[rocm-libraries] ROCm/rocm-libraries#4469 (commit 0844cb0) [CK_TILE] Add pooling in tile_engine ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> Add pooling in ck tile engine ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-01 07:32:36 +00:00
Yi DING	791afc6465	[rocm-libraries] ROCm/rocm-libraries#5991 (commit 8d85e8e) [CK_TILE] Fix FMHA BWD IGLP incorrect results due to AGPR misallocation (#5991) ## Motivation After PR #5790 removed the `if constexpr(FmhaMask::IsMasking)` guard around the `num_total_loop <= 0` early-exit check, the IGLP pipeline (`BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`) produces incorrect dK/dV gradients for non-masking kernels (even with fix in #5915). Assembly inspection confirms that the CFG change causes the LLVM register allocator to reuse AGPR accumulators as scratch destinations in the dK/dV reduction loop, breaking the loop-carried accumulation across Q-tile iterations. ## Technical Details - Add `[[unlikely]]` to the `num_total_loop <= 0` early-exit in `BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`. This attribute is load-bearing: it restores the CFG shape that the register allocator needs to correctly assign dedicated AGPRs to each column of the dK/dV accumulator. - Only the IGLP pipeline is affected; the other two BWD pipelines do not exhibit this issue. ## Test Plan ## Test Result ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-01 05:45:19 +00:00
Estevan Vedovelli	a33b5be1b9	[rocm-libraries] ROCm/rocm-libraries#6022 (commit 54b284a) [CK] contraction: extend GetTypeString() to include layout-differentiating params (#6022) ## Motivation Consumers that identify kernels by their `GetTypeString()` (such as hipTensor's actor-critic kernel selection, which hashes the string into a stable cross-platform UID) were silently dropping one of two colliding variants during registry insertion. `GetTypeString()` in `DeviceContractionMultipleD_Xdl_CShuffle` previously printed 13 template parameters, omitting `ABlockTransferSrcScalarPerVector`, `BBlockTransferSrcScalarPerVector`, `ABlockLdsExtraM`, and `BBlockLdsExtraN`. These four parameters determine the block-transfer access width and LDS padding strategy, and are precisely what differentiates the `kk`, `kn`, `mk`, and `mn` layout variants from one another when all other geometry parameters are equal. Two instantiations with identical 13-parameter strings are distinct C++ types that accept different stride layouts and reject each other's arguments via `IsSupportedArgument`. This patch extends the output to 17 parameters so that every distinct template instantiation of this class produces a unique `GetTypeString()`. ## Technical Details `include/ck/tensor_operation/gpu/device/impl/device_contraction_multiple_d_xdl_cshuffle.hpp`: - extend `GetTypeString()` from 13 to 17 parameters including `ABlockTransferSrcScalarPerVector`, `BBlockTransferSrcScalarPerVector`, `ABlockLdsExtraM`, and `BBlockLdsExtraN`. ## Test Plan Build CK and hipTensor with these changes, and verify hipTensor can differentiate and select the correct kernels with layout variations. ## Test Result CK is building correctly and hipTensor is selecting the kernels correctly. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-31 15:19:43 +00:00
Bartłomiej Kocot	ef4ff4667d	[rocm-libraries] ROCm/rocm-libraries#5842 (commit 04c5690) [CK][CK Tile] Force padding for atomic_add bf16 C tensor (#5842) ## Motivation Force padding for atomic_add bf16 C tensor to avoid memfaults. ## Technical Details - add global atomic add for bf16 and enable them - add padding for atomic add bf16 due to the lack of oob - remove padding for not continous dims in conv for other cases - minor bwd data conv fixes ## Test Plan test_grouped_conv_*_tile ## Test Result pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-31 08:03:41 +00:00
jakpiase	66dc81d530	[rocm-libraries] ROCm/rocm-libraries#5729 (commit 516c974) [CK_TILE] Changed cshuffle LDS descriptor to naive layout (#5729) ## Motivation This PR changes gemm/convolution cshuffle layout into plain one. to improve cshuffle operation performance. ## Technical Details The purpose is that before this change the cshuffle layout was having some descriptor transformations that were probably aimed at reducing LDS bank conflicts, but the transformations itself were terribly slow, which negatively impacted the performance. ## Test Plan There is no need for additional tests, since current tests cover this functionality.	2026-03-31 03:40:25 +00:00
Illia Silin	e6b8094f94	[rocm-libraries] ROCm/rocm-libraries#5921 (commit 032ac1b) [CK] fix clang lifetimebound errors with staging compiler (#5921) ## Motivation The ROCm staging compiler (newer Clang) enforces `[[clang::lifetimebound]]` annotations on methods that return references or pointers to internal object data. Without these annotations, the staging compiler emits compilation errors for container accessor methods across the CK and CK Tile namespaces. ## Technical Details Adds `[[clang::lifetimebound]]` to all reference/pointer-returning accessors in core container types: `ck::` namespace: - `Array` -- `At()`, `operator[]`, `operator()`, `begin()`, `end()` - `index_array` -- `operator[]` - `StaticallyIndexedArray_v2` -- `At()`, `operator[]`, `operator()` - `IndexLookupTable` -- `operator[]` `ck_tile::` namespace: - `array` -- `get(i)`, `at()`, `operator[]`, `operator()` - `static_array` -- `operator[]` - `thread_buffer` -- `get(i)`, `at()`, `operator[]`, `operator()` - `make_kernel()` -- parameter pack Also removes the unused `instance_index` variable from `batched_gemm_reduce_fp16.cpp` and simplifies its argument parsing accordingly. ## Test Plan - Compile with the staging compiler to verify all lifetimebound errors are resolved - Existing tests pass unchanged -- the attribute is a compile-time annotation with no runtime effect ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-30 14:20:20 +00:00
Hosang Yoon	2dcae9d173	[rocm-libraries] ROCm/rocm-libraries#5977 (commit 794bea7) [CK_TILE] Fix Windows build in FMHA head grouping ## Motivation This is a follow-up fix for [PR #5018](https://github.com/ROCm/rocm-libraries/pull/5018). [PR #5018](https://github.com/ROCm/rocm-libraries/pull/5018) added LLC-aware FMHA head grouping / head-major scheduling on RDNA, but it also introduced Linux-only code paths, including `<dirent.h>`, which break Windows builds. This change fixes that by guarding the Linux-specific LLC probing logic so non-Linux platforms can still build correctly. ## Technical Details - Guard `<dirent.h>` with `#ifdef __linux__` - Guard KFD sysfs traversal logic with `#if defined(__linux__)` - On non-Linux platforms, return `0` from `get_kfd_sysfs_llc_cache_bytes()` - Preserve existing fallback behavior through: - `CK_TILE_FMHA_LLC_CACHE_MB` - arch-based default LLC sizes - no head grouping when no LLC size can be resolved ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-30 14:19:19 +00:00
Jeff Huang	7968368d92	[rocm-libraries] ROCm/rocm-libraries#5918 (commit a7e2c67) [CK][CK_TILE] Add fp8bf16 hdim=256 tile for batch prefill (#5918) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation FP8 batch prefill kernels currently only support head_dim=128. Models with head_dim=256 hit the "invalid argument for batch_prefill" error because no matching kernel variant exists in the codegen dispatch. ## Technical Details Add a hdim=256 tile size entry for fp8bf16 in the batch prefill codegen recipe (`fmha_batch_prefill.py`). Tile configuration: `FmhaFwdTileSize(128, 128, 32, 256, 32, 256, 4,1,1, 4,1,1, 32,32,32, 32,32,32, -1)` - bm0=128, bn0=128 (Q/K tile sizes) - bn1=256, bk0max=256 (V head_dim=256) - Warp MFMA 32x32x32 (fp8 MFMA instructions) This mirrors the existing bf16/fp16 hdim=256 tile but uses fp8 warp sizes. ## Test Plan Tested on both MI308X (gfx942) and MI355X (gfx950) via aiter batch prefill test with the following matrix: - page_size: {1, 16, 1024} - kv_layout: {linear, vectorized} - lookup_table: {sglang, vllm} - causal: {true, false} - logits_soft_cap: {0.0, 30.0} - contiguous_kv: {true, false} ## Test Result MI308X (gfx942): 160 passed, 32 skipped (page_size=1 + vectorized not applicable) MI355X (gfx950): 120 passed, 72 skipped (pre-existing ROCm 7.2 compiler issue with causal + no softcap) No register spills on either platform. ### Profiling — MI355X (gfx950), FP8 pertensor, hdim=256, seqlen=1024, 8 heads \| page_sz \| kv_layout \| table \| causal \| soft_cap \| time_us \| TFLOPS \| \|---------\|-----------\|-------\|--------\|----------\|---------\|--------\| \| 1 \| linear \| sglang \| False \| 0.00 \| 55.01 \| 156.16 \| \| 1 \| linear \| vllm \| False \| 0.00 \| 55.12 \| 155.84 \| \| 1 \| linear \| sglang \| False \| 30.00 \| 62.63 \| 137.16 \| \| 1 \| linear \| vllm \| False \| 30.00 \| 62.16 \| 138.20 \| \| 1 \| linear \| sglang \| True \| 30.00 \| 64.09 \| 67.01 \| \| 1 \| linear \| vllm \| True \| 30.00 \| 63.85 \| 67.27 \| \| 16 \| linear \| sglang \| False \| 0.00 \| 57.00 \| 150.69 \| \| 16 \| vectorized \| sglang \| False \| 0.00 \| 57.55 \| 149.25 \| \| 16 \| linear \| vllm \| False \| 0.00 \| 56.80 \| 151.23 \| \| 16 \| vectorized \| vllm \| False \| 0.00 \| 57.32 \| 149.87 \| \| 16 \| linear \| sglang \| False \| 30.00 \| 64.77 \| 132.62 \| \| 16 \| vectorized \| vllm \| False \| 30.00 \| 63.54 \| 135.18 \| \| 16 \| linear \| sglang \| True \| 30.00 \| 66.84 \| 64.26 \| \| 16 \| vectorized \| vllm \| True \| 30.00 \| 66.12 \| 64.96 \| \| 1024 \| linear \| sglang \| False \| 0.00 \| 58.25 \| 147.46 \| \| 1024 \| vectorized \| sglang \| False \| 0.00 \| 57.53 \| 149.31 \| \| 1024 \| linear \| vllm \| False \| 0.00 \| 58.06 \| 147.94 \| \| 1024 \| vectorized \| vllm \| False \| 0.00 \| 57.55 \| 149.27 \| \| 1024 \| linear \| sglang \| False \| 30.00 \| 65.38 \| 131.38 \| \| 1024 \| vectorized \| vllm \| False \| 30.00 \| 63.64 \| 134.98 \| \| 1024 \| linear \| sglang \| True \| 30.00 \| 66.85 \| 64.25 \| \| 1024 \| vectorized \| vllm \| True \| 30.00 \| 65.26 \| 65.81 \| ### Profiling — MI308X (gfx942), FP8 pertensor, hdim=256, seqlen=1024, 8 heads \| page_sz \| kv_layout \| table \| causal \| soft_cap \| time_us \| TFLOPS \| \|---------\|-----------\|-------\|--------\|----------\|---------\|--------\| \| 1 \| linear \| sglang \| False \| 0.00 \| 110.18 \| 77.96 \| \| 1 \| linear \| vllm \| True \| 30.00 \| 134.33 \| 31.97 \| \| 1 \| linear \| sglang \| True \| 30.00 \| 134.59 \| 31.91 \| \| 16 \| linear \| sglang \| False \| 0.00 \| 115.43 \| 74.42 \| \| 16 \| vectorized \| sglang \| False \| 0.00 \| 106.11 \| 80.95 \| \| 16 \| linear \| vllm \| False \| 0.00 \| 116.34 \| 73.83 \| \| 16 \| vectorized \| vllm \| False \| 0.00 \| 106.17 \| 80.91 \| \| 16 \| linear \| sglang \| False \| 30.00 \| 135.61 \| 63.34 \| \| 16 \| vectorized \| vllm \| False \| 30.00 \| 122.37 \| 70.20 \| \| 16 \| linear \| sglang \| True \| 0.00 \| 117.44 \| 36.57 \| \| 16 \| vectorized \| vllm \| True \| 0.00 \| 108.81 \| 39.47 \| \| 16 \| linear \| sglang \| True \| 30.00 \| 139.43 \| 30.80 \| \| 16 \| vectorized \| vllm \| True \| 30.00 \| 125.87 \| 34.12 \| \| 1024 \| linear \| sglang \| False \| 0.00 \| 110.65 \| 77.63 \| \| 1024 \| vectorized \| sglang \| False \| 0.00 \| 101.70 \| 84.46 \| \| 1024 \| linear \| vllm \| False \| 0.00 \| 111.71 \| 76.89 \| \| 1024 \| vectorized \| vllm \| False \| 0.00 \| 101.55 \| 84.59 \| \| 1024 \| linear \| sglang \| False \| 30.00 \| 129.33 \| 66.42 \| \| 1024 \| vectorized \| vllm \| False \| 30.00 \| 120.95 \| 71.02 \| \| 1024 \| linear \| sglang \| True \| 0.00 \| 112.26 \| 38.26 \| \| 1024 \| vectorized \| vllm \| True \| 0.00 \| 103.02 \| 41.69 \| \| 1024 \| linear \| sglang \| True \| 30.00 \| 133.73 \| 32.12 \| \| 1024 \| vectorized \| vllm \| True \| 30.00 \| 124.75 \| 34.43 \| ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-30 10:21:33 +00:00
Yi DING	fb64a4453c	[rocm-libraries] ROCm/rocm-libraries#5915 (commit a72cf7d) [CK_TILE] Fix FMHA BWD register pressure by wrapping num_total_loop with amd_wave_read_first_lane (#5915) ## Motivation In three FMHA backward pipelines, `num_total_loop` is computed without `amd_wave_read_first_lane()`, so the compiler treats it as a VGPR even though it is logically uniform across all lanes. This raises register pressure, and under high pressure the compiler may reuse VGPRs across overlapping live ranges. This was confirmed via assembly inspection: the compiler reused `v52:v53` as both the B-matrix input for dK MFMAs and an intermediate value for dV, producing incorrect dK/dV gradients. ## Technical Details Wrap `num_total_loop` with `amd_wave_read_first_lane()` in three pipelines: - `block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr` - `block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr_iglp` - `block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr` This promotes `num_total_loop` to an SGPR, eliminating the excess register pressure and the incorrect VGPR reuse. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-30 01:45:16 +00:00
Jan Patrick Lehr	b6bbada9f1	[rocm-libraries] ROCm/rocm-libraries#5639 (commit a65e645) [CK] More lifetime-warning suppression ## Motivation The staging compiler picked up another change from upstream that leads to more lifetime-analysis warnings. This breaks the build, given CK is built with -Werror. As a result, compiler promotion is blocked. ## Technical Details This patch adds the pragma push diagnostics to ignore the lifetime-warnings in the modified files to unblock compiler promotion. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-28 11:20:51 +00:00
Linjun-AMD	3b55a05e71	[rocm-libraries] ROCm/rocm-libraries#5849 (commit d9b89b2) [CK_TILE ]Revert "[CK_TILE] Enable MXFP6 for MX GEMM op (#5095)" (#5849) This reverts commit 7e55766ddf7e9e20791b0e4e2d7b4026cf16b637. ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-27 20:37:23 +00:00
Bartłomiej Kocot	c28d0033d7	[rocm-libraries] ROCm/rocm-libraries#5785 (commit d8ecfc1) [CK] Fix min k_batch calculation in conv kernels ## Motivation Avoid division by 0 and remove not needed "-1". ## Technical Details Our div up implementation return lower value if input is divisible. There is no need to subtract 1. ## Test Plan test_grouped_conv_bwd_weight ## Test Result Passed locally. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-1019	2026-03-27 15:38:21 +00:00
Illia Silin	4c926497ad	[rocm-libraries] ROCm/rocm-libraries#5829 (commit 19b2813) [CK] Fix error in dockerfile when building staging compiler. (#5829) ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-27 15:37:21 +00:00
Johannes Graner	58475d3f45	[rocm-libraries] ROCm/rocm-libraries#5393 (commit d51b649) [CK Tile] StreamK support for Bwd Weight grouped convolutions (#5393) ## Motivation Add StreamK work distribution to the CK Tile grouped convolution backward weight kernel. Split-K divides the K-dimension uniformly across a fixed `k_batch`, which causes load imbalance when the number of output tiles doesn't evenly fill the GPU. StreamK distributes total K-iterations evenly across workgroups, improving utilization on these shapes. ## Technical Details StreamK is added as an `if constexpr` branch in the existing kernel, selected by the `TilePartitioner_` template parameter. Two reduction strategies are supported: - Linear: tile-starter sequentially accumulates partials from contributing CTAs - Tree: pairwise binary tree reduction (O(log n) depth, faster for many contributors) Both persistent and non-persistent data-parallel (DP) sections are supported. Key changes: - `grouped_convolution_backward_weight_kernel.hpp`: StreamK execution path with `RunStreamK`/`RunStreamKLoop`, partial store/load via workspace, flag-based cross-CTA synchronization, `GridSize`/`MakeKernelArgs`/`GetWorkSpaceSize` extensions - `streamk_common.hpp`: Shared `StreamKReductionOps` (reduction helpers) and `StreamKDispatch` (persistent/non-persistent DP dispatch), used by both GEMM and Conv StreamK kernels - `streamk_gemm_kernel.hpp`: Refactored to use shared helpers - Merged split-K and StreamK example invokers via `PartitionerPolicy` template parameter - StreamK example binary with `--streamk_reduction=linear\|tree` and `--streamk_persistent=0\|1` - CK Builder integration: `SpecifiesStreamK` concept, `TilePartitionerType` factory helper, `InstanceTraits` with StreamK fields - 30 tests: host-side, GPU end-to-end (Linear + Tree + Persistent DP), negative, builder regression ### Performance (MI355X, gfx950) Speedup relative to best split-K (sweep over k_batch={1,2,4,8,16,32}): \| Shape \| 16x64 tiles \| \| 128x128 tiles \| \| \|---\|---\|---\|---\|---\| \| \| Split-K \| StreamK \| Split-K \| StreamK \| \| 1x1 128x128 N=32 28x28 \| 1.00x \| 0.54x \| 1.00x \| 0.81x \| \| 3x3 128x128 N=32 14x14 \| 1.00x \| 0.59x \| 1.00x \| 0.62x \| \| 1x1 256x64 N=32 56x56 \| 1.00x \| 0.83x \| 1.00x \| 1.83x \| \| 3x3 512x512 N=2 7x7 \| 1.00x \| 1.12x \| 1.00x \| 0.62x \| \| 1x1 1024x1024 N=4 7x7 \| 1.00x \| 1.09x \| 1.00x \| 0.60x \| \| 3x3 128x128 N=32 28x28 \| 1.00x \| 0.44x \| 1.00x \| 0.96x \| \| 3x3 256x256 N=32 14x14 \| 1.00x \| 0.67x \| 1.00x \| 0.93x \| \| 3x3 512x512 N=32 7x7 \| 1.00x \| 0.98x \| 1.00x \| 1.16x \| StreamK's value depends on tile config: with larger tiles (fewer output tiles), StreamK delivers up to 1.83x speedup on bottleneck shapes and up to 1.16x on typical large-channel convolutions. Tree reduction consistently outperforms Linear when multiple CTAs contribute to the same tile (up to 2.87x faster), due to O(log n) reduction depth vs O(n) sequential accumulation. The table reports the best of Linear and Tree for each shape. ## Test Plan ```bash ninja -C build test_ck_tile_grouped_conv_bwd_weight_streamk ./build/bin/test_ck_tile_grouped_conv_bwd_weight_streamk # Builder tests (requires CK_EXPERIMENTAL_BUILDER=ON) ninja -C build check-builder ``` 30 tests covering: - Host-side: type traits, kernel args construction, grid size, workspace size - GPU end-to-end (Linear + Tree): small/medium shapes, multi-group, stride>1, pure-DP degeneration, single-tile all-SK, large GemmK, higher occupancy - Persistent DP: Linear + Tree with persistent data-parallel dispatch - Negative: `IsSupportedArgument` rejects unaligned K and C - Builder: Create (instance string validation) + Execution (reference comparison) + instance string regression ## Test Result All 30 conv StreamK tests pass on MI355X (gfx950). 64/64 GEMM StreamK tests pass. Full `check-builder` suite passes. Tolerances computed dynamically using `calculate_rtol_atol` pattern (fp16 ULP-aware). ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-27 09:18:14 +00:00
arai713	36f2ec23f5	[rocm-libraries] ROCm/rocm-libraries#5445 (commit 2cdbf8b) [CK_TILE] Support for CompV4 pipeline in Stream-K GEMM (#5445) ## Motivation This PR is extending the pipeline support for Stream-K GEMM by adding the CompV4 pipeline. Additional pipelines will be added in subsequent PRs. ## Technical Details - Enable the CompV4 pipeline by adding an option to set DoubleSMemBuffer to true if the CompV4 pipeline has been selected as it requires double buffered shared memory - Addition of CompV4 pipeline into the extended tests: kernel instances mirror the existing CompV3/Mem configurations (same layout permutations, data types, and tile sizes) with the pipeline type set to CompV4. - Addition of CompV4 pipeline into smoke tests (generated using Tile Engine) ## Test Plan These were tested using the existing smoke and extended tests. ## Test Result All tests passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-27 08:13:27 +00:00
Yi DING	47a04fda08	[rocm-libraries] ROCm/rocm-libraries#5790 (commit c132b5a) [CK_TILE] Fix NaN for FMHA BWD When seq_q=0 ## Motivation This PR addresses NaNs in the FMHA backward (dQ/dK/dV) path when the effective query sequence length for a tile is zero, by ensuring the per-tile pipelines exit early with zeroed accumulators and by avoiding an early kernel return that prevented writing out cleared gradients. ## Technical Details - Add unconditional early-exit in the dK/dV pipelines when `num_total_loop <= 0` (no work), returning zeroed accumulators. - Adjust group-mode kernel early-return logic to only return when both `seqlen_q` and `seqlen_k` are zero, allowing blocks to run and store cleared dK/dV when `seqlen_q == 0`. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-27 07:54:53 +00:00
Yaswanth Raparti	e2470e837a	[rocm-libraries] ROCm/rocm-libraries#5880 (commit a6b6c05) [CK][CK_TILE] Fix CTest parsing to handle all test number formats (#5880) ## Motivation Fix a bug in the smart-build --ctest-only filter that was incorrectly excluding tests with numbers less than 100. ## Technical Details The issue was caused by CTest formatting test numbers with variable spacing based on the number of digits: - "Test `#1`: name (3 spaces for tests 1-9)" - "Test `#79`: name (2 spaces for tests 10-99)" - "Test `#100`: name (1 space for tests 100+)" The previous code used `line.strip().startswith("Test #")` which only matched tests with a single space (i.e., test numbers >= 100). This caused tests like ck_tile_unit_sequence (Test #79) to be excluded from smart-build test selection, resulting in CTest failures when the binary wasn't built. Solution: Replace string matching with a regex pattern that handles all spacing variations: r'^\sTest\s+#\d+:\s(.+)$' ## Test Plan Tested with test numbers from 1 to 12345. ## Test Result - Before: 48 tests selected (only tests #100+) - After: 146 tests selected (all CTest-registered tests) ## Submission Checklist - [x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-03-27 06:34:12 +00:00
Illia Silin	2f98c7bbef	[rocm-libraries] ROCm/rocm-libraries#5891 (commit 82563ff) fix AITER docker setup ## Motivation Add a new python package required to build AITER. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-27 04:36:16 +00:00
Bartłomiej Kocot	1c95ce0668	[rocm-libraries] ROCm/rocm-libraries#5856 (commit 2d9a0a1) [CK] Fix unused param mask ## Motivation Compiler error caused by unused param mask. ## Technical Details Skip tests using param mask in test loop. ## Test Plan Current test improvements. ## Test Result Passed locally ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-27 03:58:37 +00:00
dependabot[bot]	6215bb8dbc	[rocm-libraries] ROCm/rocm-libraries#5896 (commit b7436b5) Bump requests from 2.32.5 to 2.33.0 in /projects/composablekernel/docs/sphinx (#5896) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Bumps [requests](https://github.com/psf/requests) from 2.32.5 to 2.33.0. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/psf/requests/releases">requests's releases</a>.</em></p> <blockquote> <h2>v2.33.0</h2> <h2>2.33.0 (2026-03-25)</h2> <p><strong>Announcements</strong></p> <ul> <li>📣 Requests is adding inline types. If you have a typed code base that uses Requests, please take a look at <a href="https://redirect.github.com/psf/requests/issues/7271">#7271</a>. Give it a try, and report any gaps or feedback you may have in the issue. 📣</li> </ul> <p><strong>Security</strong></p> <ul> <li>CVE-2026-25645 <code>requests.utils.extract_zipped_paths</code> now extracts contents to a non-deterministic location to prevent malicious file replacement. This does not affect default usage of Requests, only applications calling the utility function directly.</li> </ul> <p><strong>Improvements</strong></p> <ul> <li>Migrated to a PEP 517 build system using setuptools. (<a href="https://redirect.github.com/psf/requests/issues/7012">#7012</a>)</li> </ul> <p><strong>Bugfixes</strong></p> <ul> <li>Fixed an issue where an empty netrc entry could cause malformed authentication to be applied to Requests on Python 3.11+. (<a href="https://redirect.github.com/psf/requests/issues/7205">#7205</a>)</li> </ul> <p><strong>Deprecations</strong></p> <ul> <li>Dropped support for Python 3.9 following its end of support. (<a href="https://redirect.github.com/psf/requests/issues/7196">#7196</a>)</li> </ul> <p><strong>Documentation</strong></p> <ul> <li>Various typo fixes and doc improvements.</li> </ul> <h2>New Contributors</h2> <ul> <li><a href="https://github.com/M0d3v1"><code>@M0d3v1</code></a> made their first contribution in <a href="https://redirect.github.com/psf/requests/pull/6865">psf/requests#6865</a></li> <li><a href="https://github.com/aminvakil"><code>@aminvakil</code></a> made their first contribution in <a href="https://redirect.github.com/psf/requests/pull/7220">psf/requests#7220</a></li> <li><a href="https://github.com/E8Price"><code>@E8Price</code></a> made their first contribution in <a href="https://redirect.github.com/psf/requests/pull/6960">psf/requests#6960</a></li> <li><a href="https://github.com/mitre88"><code>@mitre88</code></a> made their first contribution in <a href="https://redirect.github.com/psf/requests/pull/7244">psf/requests#7244</a></li> <li><a href="https://github.com/magsen"><code>@magsen</code></a> made their first contribution in <a href="https://redirect.github.com/psf/requests/pull/6553">psf/requests#6553</a></li> <li><a href="https://github.com/Rohan5commit"><code>@Rohan5commit</code></a> made their first contribution in <a href="https://redirect.github.com/psf/requests/pull/7227">psf/requests#7227</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/psf/requests/blob/main/HISTORY.md#2330-2026-03-25">https://github.com/psf/requests/blob/main/HISTORY.md#2330-2026-03-25</a></p> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/psf/requests/blob/main/HISTORY.md">requests's changelog</a>.</em></p> <blockquote> <h2>2.33.0 (2026-03-25)</h2> <p><strong>Announcements</strong></p> <ul> <li>📣 Requests is adding inline types. If you have a typed code base that uses Requests, please take a look at <a href="https://redirect.github.com/psf/requests/issues/7271">#7271</a>. Give it a try, and report any gaps or feedback you may have in the issue. 📣</li> </ul> <p><strong>Security</strong></p> <ul> <li>CVE-2026-25645 <code>requests.utils.extract_zipped_paths</code> now extracts contents to a non-deterministic location to prevent malicious file replacement. This does not affect default usage of Requests, only applications calling the utility function directly.</li> </ul> <p><strong>Improvements</strong></p> <ul> <li>Migrated to a PEP 517 build system using setuptools. (<a href="https://redirect.github.com/psf/requests/issues/7012">#7012</a>)</li> </ul> <p><strong>Bugfixes</strong></p> <ul> <li>Fixed an issue where an empty netrc entry could cause malformed authentication to be applied to Requests on Python 3.11+. (<a href="https://redirect.github.com/psf/requests/issues/7205">#7205</a>)</li> </ul> <p><strong>Deprecations</strong></p> <ul> <li>Dropped support for Python 3.9 following its end of support. (<a href="https://redirect.github.com/psf/requests/issues/7196">#7196</a>)</li> </ul> <p><strong>Documentation</strong></p> <ul> <li>Various typo fixes and doc improvements.</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="`bc04dfd6da`"><code>bc04dfd</code></a> v2.33.0</li> <li><a href="`66d21cb07b`"><code>66d21cb</code></a> Merge commit from fork</li> <li><a href="`8b9bc8fc0f`"><code>8b9bc8f</code></a> Move badges to top of README (<a href="https://redirect.github.com/psf/requests/issues/7293">#7293</a>)</li> <li><a href="`e331a288f3`"><code>e331a28</code></a> Remove unused extraction call (<a href="https://redirect.github.com/psf/requests/issues/7292">#7292</a>)</li> <li><a href="`753fd08c5e`"><code>753fd08</code></a> docs: fix FAQ grammar in httplib2 example</li> <li><a href="`774a0b837a`"><code>774a0b8</code></a> docs(socks): same block as other sections</li> <li><a href="`9c72a41bec`"><code>9c72a41</code></a> Bump github/codeql-action from 4.33.0 to 4.34.1</li> <li><a href="`ebf7190679`"><code>ebf7190</code></a> Bump github/codeql-action from 4.32.0 to 4.33.0</li> <li><a href="`0e4ae38f0c`"><code>0e4ae38</code></a> docs: exclude Response.is_permanent_redirect from API docs (<a href="https://redirect.github.com/psf/requests/issues/7244">#7244</a>)</li> <li><a href="`d568f47278`"><code>d568f47</code></a> docs: clarify Quickstart POST example (<a href="https://redirect.github.com/psf/requests/issues/6960">#6960</a>)</li> <li>Additional commits viewable in <a href="https://github.com/psf/requests/compare/v2.32.5...v2.33.0">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=requests&package-manager=pip&previous-version=2.32.5&new-version=2.33.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end)	2026-03-26 22:01:37 +00:00
joyeamd	046d3ac274	[rocm-libraries] ROCm/rocm-libraries#5789 (commit 6654ca6) [CK][CK_TILE] Revert addional oob check in gemm IsSupported function (#5789) ## Motivation fix ck_tile's oob check. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-26 01:41:35 +00:00
Estevan Vedovelli	0004a37de5	[rocm-libraries] ROCm/rocm-libraries#5675 (commit fbd7fa7) [CK] Properly build HIPTENSOR_REQ_LIBS_ONLY targets when used in addition to MIOPEN_REQ_LIBS_ONLY (#5675) ## Motivation When building CK with both -DHIPTENSOR_REQ_LIBS_ONLY=ON and -DMIOPEN_REQ_LIBS_ONLY=ON, only MIOpen targets were being properly installed. This change is necessary to allow hipTensor to build with TheRock without the need to rebuild CK from source. ## Technical Details The solutions consists in considering both HIPTENSOR_REQ_LIBS_ONLY and MIOPEN_REQ_LIBS_ONLY when including hiptensor's targets in CMake, following the same approach used to the conv target (for MIOpen). ## Test Plan Manually test the build and installation with `-DHIPTENSOR_REQ_LIBS_ONLY=ON` and both `-DHIPTENSOR_REQ_LIBS_ONLY=ON -DMIOPEN_REQ_LIBS_ONLY=ON`, and verify that the proper files as installed. ## Test Result The build with `-DHIPTENSOR_REQ_LIBS_ONLY=ON` properly includes the targets contraction, reduction and other, while `-DHIPTENSOR_REQ_LIBS_ONLY=ON -DMIOPEN_REQ_LIBS_ONLY=ON` includes conv, contraction, reduction and other. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-25 23:59:53 +00:00
Illia Silin	86ec92f925	[rocm-libraries] ROCm/rocm-libraries#5571 (commit 8f60932) [CK] fix clang lifetime bound error in ck_builder. ## Motivation This resolves the compilation error with latest develop compiler branch. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-25 16:45:38 +00:00
Illia Silin	bee61860c2	[rocm-libraries] ROCm/rocm-libraries#5764 (commit f3c1232) Re-enable daily builds with staging compiler ## Motivation This should help us catch and fix any new compilation issues early on. ## Technical Details We now have three compiler profiles: * develop: slightly stabilized version of amd-staging with some of the obvious offending PRs reverted, 1-2 weeks behind amd-staging; * amd-mainline: more stable version of compiler, the baseline for all other branches, e.g., release, npi, etc. 2-4 weeks behind amd-staging. * amd-staging: latest compiler version where all new PRs land, often broken; ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: kensclin <lshyhchy@amd.com>	2026-03-25 16:37:58 +00:00
Ville Pietilä	ec2dbfbfde	[rocm-libraries] ROCm/rocm-libraries#5516 (commit ff3afda) [CK_TILE, CK_BUILDER] Add bwd data to CK Tile profiler (#5516) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation We want close the performance gap between old CK and CK Tile for bwd data convolutions. To achieve this, we need tow things - Configurations for the old CK kernel instances such that we can map them into CK Tile instances. - Support in CK profiler to run the CK Tile instance with the same API as for old CK instances. ## Technical Details Extracted kernel configurations from old CK. The codegen python script for CK Tile convs is extended to support also bwd data. The generated instances are added to the CMake build (target `device_grouped_conv_bwd_data_tile_instances`). A new profiler op (`grouped_conv_bwd_data_tile`) has been added to the CK Profiler. The API is same as for old CK's profiler op `grouped_conv_bwd_data`.	2026-03-25 14:36:11 +00:00
joyeamd	1834e318da	[rocm-libraries] ROCm/rocm-libraries#5697 (commit dd1c396) Revert "Ck/joye/revert oob check (#5640)" This reverts commit 552ab4880292694cb8261f40fa4223af52cb8419. ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-23 22:05:30 +00:00
andrew clark	5a4243096b	[rocm-libraries] ROCm/rocm-libraries#5713 (commit e179279) Adding New Notification Detection ## Motivation Restricting one of the notification failure patterns to match a specific missing drivers log pattern. This will help reduce the noise of erroneous logs. Also adding a new failure pattern to notify us of Github access issues. ## Technical Details - Set the failure pattern to match the exact failure observed in the logs. - Switching to a plain substring search so special characters are handled literally. - Added a new failure pattern for Github access errors. ## Test Plan - Force a failure using the known failure patterns. ## Test Result The forced failures were triggered and caught by the notification system. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-23 20:57:55 +00:00
Eiden Yoshida	ba2fb0224f	[rocm-libraries] ROCm/rocm-libraries#5691 (commit 2fbb1fc) [CK] MICI: Revert "add self healing to ref repo" The check may not be working as intended, causing premature deletion of reference repositories	2026-03-23 14:16:53 +00:00
Bartłomiej Kocot	f79926009b	[rocm-libraries] ROCm/rocm-libraries#5555 (commit 1d2c4c8) [CK][CK Tile] Fix kbatch check in grouped conv and gemm kernels (#5555) ## Motivation Fix kbatch check in grouped conv and gemm kernels, allow tails for kbatch. ## Technical Details Round up K / Kperxdl and divide it by Kbatch to allow tail for K. ## Test Plan test_grouped_convnd_bwd_weight_tile ## Test Result passed locally ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-21 22:56:19 +00:00
Emily Martins	6b69ac9676	[rocm-libraries] ROCm/rocm-libraries#5625 (commit 7d2ed43) [CK_TILE] Prune Stream-K Tile Engine Tests ## Motivation Stream-K tile engine tests are causing issues for build time. While we work on a more permanent solution, these changes prune the Stream-K test instances to help reduce the build time burden. ## Technical Details The Stream-K team recently transitioned to using CK Tile's tile engine infrastructure for our smoke tests. However, since tile engine creates an individual target per kernel instance, we've found that the tile engine tests are increasing build times. Our team is currently working to convert our existing tile engine tests back to basic gtests. While this work takes place, we are temporarily pruning the existing Stream-K tile engine test instances to help reduce the build time burden. ## Test Plan Ran the pruned test set on all gfx90a, gfx942, and gfx950. ## Test Result All tests pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 20:31:39 +00:00
andrew clark	a66047ad09	[rocm-libraries] ROCm/rocm-libraries#5464 (commit debfc96) Improved CI infrastructure failure detection ## Motivation This PR re-enables CI infrastructure failure detection and notification, which had been disabled due to performance issues caused by loading large build logs (~80k lines) into memory for pattern scanning. The goal is to reliably detect known infrastructure failures (GPU errors, Docker authentication issues, disk space errors, etc.) and send actionable Teams notifications without hanging on large logs. ## Technical Details - Replaced full build log loading and Groovy-based pattern scanning with a streaming wget \| grep -E pipe. grep scans natively so the full log is never loaded into Groovy, resolving the hang on large logs. - Combined all failure patterns into a single grep -E call to avoid multiple log fetches. - The node name is now tracked with the observed failure. - Added a new failure pattern for device's running out of space. ## Test Plan - Forced failures in the "Determine CI Execution" stage with all 9 failure patterns echoed to the build log. - Simulated large log sizes (~80k lines of dummy output) to validate pattern detection and node name extraction at realistic log scales, including patterns placed both before and after large blocks of dummy output. ## Test Result All 9 failure patterns detected correctly. Teams notifications sent with accurate log context, node name, and job links. No hangs observed on 80k line simulated logs. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 19:18:07 +00:00

1 2 3 4 5 ...

3214 Commits