composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-12 01:10:17 +00:00

Author	SHA1	Message	Date
alexxu-amd	59f8535bf9	[rocm-libraries] ROCm/rocm-libraries#6326 (commit c1b6c3e) Correct .readthedocs.yml file path ## Motivation Read the Docs config files contains outdated file path from their legacy repos. Update and correct all paths. ## Technical Details ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-09 20:30:41 +00:00
Vidyasagar Ananthan	920acd2c12	[rocm-libraries] ROCm/rocm-libraries#5168 (commit 8b5afcb) [CK] [CK_Tile] Add GroupConv to Kernel Dispatcher ## Motivation This PR adds CK Tile group convolution (forward, backward-data, backward-weight) support to the kernel dispatcher, matching and unifying with the existing dispatcher GEMM infrastructure in architecture and usability. The dispatcher provides a unified kernel dispatch system with both C++ and Python frontends, and until now only supported GEMM operations. This PR enables framework integrators to use the same declarative kernel workflow for convolutions as they do for GEMM: declare kernels, build a registry JIT, select kernels within the registry at runtime, and dispatch to GPU. Future PRs will include runtime kernel selection heuristics for autotuning of kernel parameters based on (problem, hardware arch). ## Technical Details Grouped convolution support has been added to the CK Tile Dispatcher with generated_conv_backend.hpp enabling dispatcher.run(in, wei, out, problem) for all 6 conv variants (fwd/bwdd/bwdw x 2D/3D), runtime heuristic kernel selection, and GroupedConvKernelKey with full ConvConfigBase fields. Python side adds parallel JIT via registry.build(max_workers) and heuristic registry.select(). Includes 7 C++ and 6 Python examples covering all directions with CPU reference validation, and shared infrastructure improvements (BaseRegistry CRTP, structured exceptions). As a sanity check, JIT compile times for a single kernel remains the same and for multiple kernels there is better parallelism: Kernels \| 1 worker \| 8 workers 1 \| 7.7 s \| 7.7 s 2 \| 15.9 s \| 8.2 s 4 \| 33.4 s \| 9.7 s 6 \| 52.3 s \| 10.2 s ## Test Plan 145 ephemeral unit tests have been added to test basic functionality. All 30 examples/integration tests run end-to-end on gfx950 (MI350): 7 C++ conv, 7 C++ GEMM, 6 Python conv, 10 Python GEMM. CPU reference validation for forward, backward-data, and backward-weight (2D) in both C++ and Python examples pass. ## Test Result 30 examples pass. Peak performance: 132 TFLOPS (Batch-32 forward 56x56), 53 TFLOPS (pointwise 1x1). CPU reference accuracy: max_abs_diff < 0.002 for all directions (fp16 vs fp32 reference). ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-09 17:39:35 +00:00
Hosang Yoon	4c0e73ab12	[rocm-libraries] ROCm/rocm-libraries#6156 (commit 367565a) [CK_TILE] Optimize FMHA head-dim padded path on gfx11/gfx12 (#6156) ## Motivation On gfx11/gfx12, FMHA forward kernels that require head-dim padding show a large performance drop compared to the exact-head-dim path. In practice, padded cases such as `HDIM=72` and `HDIM=80` were falling too far off the fast path. This PR improves padded-head-dim FMHA performance on gfx11/gfx12 while keeping the behavior for other GPUs unchanged. ## Technical Details - Add/scope a dedicated padded-head-dim (`qr_hpad`) FMHA forward path for gfx11/gfx12. - For `receipt=0`, keep support conservative and only enable the padded fast path for vector-safe cases (`head_dim % 8 == 0`), matching the existing assumption used on other GPUs. - Move `v_prefetch` later only for the head-dim-padded path on gfx11/gfx12. This reduces live ranges and removes the register-spill behavior seen in the earlier scheduling. - Enable the buffer-load OOB check offset trick for the padded path on gfx11/gfx12. ## Test Plan ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16 -d={72/80} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} ## Test Result Observed padded-head-dim performance improvements for HDIM=72/80: - gfx11: about ~3.5x - gfx1151: about ~2.0x - gfx12: about ~1.3x ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-08 14:53:18 +00:00
Yaswanth Raparti	7d6c8e5afa	[rocm-libraries] ROCm/rocm-libraries#6215 (commit bb1f765) [CK] [CK Tile] Improved ci_safety_check in smart-build infrastructure (#6215) ## Motivation The two-dot syntax (origin/develop..HEAD) is more conservative and catches a broader set of changes when PRs merge develop branch. While three-dot syntax shows only PR-specific changes, two-dot ensures we don't miss any files that differ between develop and the PR branch, including files modified in both the PR and merged develop commits. This conservative approach prioritizes catching all potential issues over CI efficiency, which is appropriate for build system change detection. # Technical Details: - Switched to two-dot (..) syntax in ci_safety_check.sh - Update comments to clarify the intentional use of two-dot syntax - Maintain consistency across both CHANGE_ID branches - Trigger full build when any of the following changes - `Dockerfile\|Jenkinsfile\|CMakePresets\.json\|script/dependency-parser/` ## Test Plan Tested with PR 6200 which has multiple merge-commits. ## Test Result It detects 43 new tests compared to 3-dot scheme. ## Submission Checklist - [x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-08 09:55:56 +00:00
Christopher Millette	a170e2bd9d	[rocm-libraries] ROCm/rocm-libraries#5939 (commit 6fb1791) [CK_TILE] Flatten nested static_for loops into static_ford (#5939) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary Mechanical conversion of 129 nested `static_for`/`static_ford` patterns to flat `static_ford` across 29 ck_tile header files. Each conversion eliminates intermediate lambda closure instantiations by replacing nested compile-time loops with a single flat iteration using index decomposition. ### What `static_ford` eliminates When `static_for` loops are nested, each level creates unique closure types: ```cpp // BEFORE: M + M×N = 20 IR functions (for M=4, N=4) static_for<0, 4, 1>{}([&](auto m) { // 4 closure instantiations static_for<0, 4, 1>{}([&](auto n) { // 4×4 = 16 closure instantiations body(m, n); }); }); // AFTER: M×N = 16 IR functions (with ford_applier, no intermediates) static_ford<sequence<4, 4>>{}([&](auto mn) { constexpr auto m = number<mn[number<0>{}]>{}; constexpr auto n = number<mn[number<1>{}]>{}; body(m, n); }); ``` ### Pattern categories converted \| Category \| Count \| Description \| \|----------\|-------\|-------------\| \| C (2-level `static_for` chains) \| 112 \| Nested `static_for` → `static_ford` \| \| C3 (3-level `static_for` chains) \| 9 \| Three consecutive nests → `static_ford` \| \| Partial rescue \| 3 \| Outer 2 levels of blocked 4-level nests \| \| B (nested `static_ford` merge) \| 5 \| Two nested `static_ford` → single higher-dim `static_ford` \| \| Total \| 129 \| Across 29 files \| 6 false positives were detected and reverted (in `tensor_adaptor.hpp`, `tile_distribution.hpp`, `tile_distribution_encoding.hpp`) where the inner loop bound depended on the outer variable. ### Files changed by family \| Family \| Files \| Sites \| \|--------\|-------\|-------\| \| Block GEMM \| 12 \| ~20 \| \| FlatMM pipelines \| 4 \| ~69 (including 5 ford-ford merges) \| \| GEMM quant \| 7 \| ~22 \| \| FlatMM kernel \| 1 \| 2 \| \| FMHA \| 1 \| 2 \| \| Reduce/norm \| 2 \| 2 \| \| Epilogue \| 1 \| 1 \| ### Blocked locations from review comments - block_gemm_areg_breg_creg_v1.hpp:356 — BLOCKED: runtime scale loads (`scale_a_slice`, `scale_b_slice`, A warp tensor load) between every nesting level - block_universal_gemm_ar_aquant_flatbr_bquant_cr.hpp:228 — BLOCKED: `zero_accumulators()` before inner loop; `sched_barrier` + conditional `block_sync_lds()` after inner loop - block_universal_gemm_as_aquant_bs_bquant_cr.hpp:298 — BLOCKED: runtime `CWarpTensor` construction before inner loop; quantization scale application code after inner loop - block_universal_gemm_as_aquant_bs_cr.hpp:277 — BLOCKED: same pattern as above - block_universal_gemm_as_bs_bquant_cr.hpp:367 — BLOCKED: same pattern as above ## Depends on - #5938 ([CK_TILE] Optimize static_ford and sequence compile-time infrastructure) — provides the `ford_applier` that makes these conversions beneficial. Without it, `static_ford` uses a recursive implementation that provides no IR function savings. ## Results (combined with #5938) ### Build Time (Wilcoxon signed-rank, 7 paired trials, gfx942) \| Target \| Base (s) \| Treat (s) \| Delta \| % \| Significant? \| \|--------\|----------\|-----------\|-------\|---\|-------------\| \| flatmm \| 161.1 \| 149.0 \| -12.1s \| -7.5% \| YES (p<0.01, 7/7 wins) \| \| universal_gemm \| 225.4 \| 220.3 \| -5.1s \| -2.3% \| YES (p<0.01, 7/7 wins) \| ### IR Function Counts (device trace, gfx942) \| Target \| InstFunc \| CodeGen \| \|--------\|----------\|---------\| \| universal_gemm \| -8.5% \| -9.2% \| \| flatmm \| -7.6% \| -10.5% \| ### ASM Equivalence 5/5 PASS — 650,151 lines verified identical (gfx942). TUs: universal_gemm, flatmm_basic, fmha_bwd, reduce, bscale. ## Test plan - [x] ASM equivalence verified (650K lines, gfx942) - [x] Wilcoxon timing verified (7 trials, p<0.01) - [x] IR function counts verified (-7.6% to -10.5% CodeGen reduction) - [ ] CI 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-04-07 14:38:07 +00:00
Po Yen Chen	c2ac7aa7b0	[rocm-libraries] ROCm/rocm-libraries#6051 (commit f0838b2) [CK] Add FP8 per-tensor quantization support for FMHA V3 pipeline (#6051) ## Motivation The existing FMHA V3 pipeline only supports fp16/bf16 data types. This PR extends V3 to handle FP8 inputs with per-tensor descaling on gfx950, enabling higher throughput for FP8 inference workloads using the assembly-optimized V3 code path. ## Technical Details Warp GEMM: - Add FP8 32x32x32 warp gemm with C-transposed distribution (`WarpGemmMfma_f32_32x32x32_fp8_fp8_CTransposed`) and dispatcher entries V3 Kernel (`fmha_fwd_v3_kernel.hpp`): - Add per-tensor descale support for Q, K, V tensors, passing descale pointers through to pipeline kargs V3 Pipeline (`block_fmha_fwd_v3_pipeline.hpp`): - Add FP8 data path with dtype-aware type selection - Add asm volatile P matrix conversion from f32 to fp8 - Add FP8-aware instruction scheduling in `CoreLoopScheduler` V3 Pipeline Policy (`block_fmha_fwd_v3_pipeline_default_policy.hpp`): - Add FP8 QK warp gemm selection (SwizzleB variant for V tile distribution compatibility) Codegen (`fmha_fwd.py`): - Add gfx950 FP8BF16 V3 tile size (256x64x128x128x64x128) - Add FP8BF16 V3 pipeline variants (mask: no/causal, qscale: no/pertensor) - Extend `can_dispatch_v3` condition for fp8bf16 + pertensor Misc: - Add LLVM scheduler `TRANS` mask to `LLVMSchedGroupMask` enum (`arch.hpp`) - Fix `mask_info` default initialization for `no_mask` case (`mask.hpp`) V3 dispatch for FP8 is disabled by default (`F_is_v3_enabled=false`) pending further validation. ## Performance: fmha_fwd V3 FP8 (avg runs 2-6, stock ROCm 7.1.1, gfx950) \| Problem \| Regular (TFlops) \| Varlen (TFlops) \| \|---\|---:\|---:\| \| batch=1 heads=6/1 seqlen=1024 causal \| 48.9 \| 47.6 \| \| batch=1 heads=6/1 seqlen=2048 causal \| 119.8 \| 117.4 \| \| batch=1 heads=6/1 seqlen=4096 causal \| 263.7 \| 259.2 \| \| batch=1 heads=6/1 seqlen=8192 causal \| 548.9 \| 543.6 \| \| batch=1 heads=6/1 seqlen=16384 causal \| 1043.0 \| 1063.7 \| \| batch=1 heads=6/1 seqlen=32768 causal \| 1237.2 \| 1279.6 \| \| batch=1 heads=6/1 seqlen=65536 causal \| 1315.4 \| 1382.7 \| \| batch=1 heads=6/1 seqlen=131072 causal \| 1326.3 \| 1402.2 \| \| batch=1 heads=16/1 seqlen=65536 causal \| 1298.7 \| 1388.4 \| \| batch=1 heads=40/40 seqlen=37200 non-causal \| 1248.9 \| 1326.1 \| ## Test Plan Tested with aiter's `test_mha_fp8.py` test suite (176 cases) covering batch sizes (1-2), sequence lengths (113-4096), head counts (5/8/32/40), GQA ratios (1:1, 1:8), and causal/non-causal modes. Verified all cases dispatch to the V3 pipeline by enabling `F_is_v3_enabled` and confirming kernel names contain `qr_async_trload_v3`. ## Test Result 176/176 tests passed with V3 enabled. All cases correctly dispatched to V3 pipeline with `pertensor` quantization. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-07 14:20:43 +00:00
Jeff Huang	020b6f435e	[rocm-libraries] ROCm/rocm-libraries#6201 (commit 5c0697e) [CK_TILLE] Temporarily remove batch prefill KV cache overflow asserts (#6201) ## Summary - Temporarily remove the KV cache offset overflow assert checks in `FmhaBatchPrefillWithPagedKVCacheKernel` - The asserts are correct, but they block project progress in certain configurations - This is a temporary workaround to unblock progress; a proper fix will follow ## Note This is NOT a permanent solution. A follow-up PR will add proper overflow handling that addresses the underlying issue without blocking progress.	2026-04-07 12:42:08 +00:00
Vidyasagar Ananthan	a586a1f8bd	[rocm-libraries] ROCm/rocm-libraries#6135 (commit 91f0518) [CK][CK_Tile] Ensure CK Tile engine benchmarking targets are excluded from default build. (#6135) ## Motivation Ensuring that tile engine benchmarking does not build by default and slow other developers. ## Technical Details - Added EXCLUDE_FROM_ALL to all add_subdirectory calls in tile_engine/CMakeLists.txt and ops/gemm/CMakeLists.txt, so none of the tile engine ops targets are part of the default all build. - Added missing EXCLUDE_FROM_ALL to add_executable in ops/pooling/CMakeLists.txt and ops/reduce/CMakeLists.txt (the GEMM variants already had it). - Downgraded message(STATUS ...) to message(VERBOSE ...) (or DEBUG for per-target creation) in ops/pooling/, ops/gemm_streamk/, and ops/reduce/ CMakeLists. The other four GEMM variants (gemm_universal, gemm_multi_d, gemm_preshuffle, grouped_gemm) already used VERBOSE. - Targets can still be built on demand via their aggregate names (e.g. make benchmark_pooling_all, make benchmark_gemm_streamk_all). ## Test Plan Tile engine benchmark testing stage should be unaffected. ## Test Result N/A ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-03 22:08:33 +00:00
harkgill-amd	3fb26ec98c	[rocm-libraries] ROCm/rocm-libraries#5141 (commit e790cc0) Add missing gfx1033 to gfx103 group definition in ck ## Motivation Resolving PyTorch build failures when enabling builds for gfx103X-all family in TheRock. https://github.com/ROCm/TheRock/pull/3763. `gfx1033` is the only failing architecture in the family and the failures point to missing support in CK. ## Technical Details PyTorch build fails with repeated error message ``` /__w/TheRock/TheRock/external-builds/pytorch/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/utility/amd_buffer_addressing_builtins.hpp:33:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD' 33 \| wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD; \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` `gfx1033` is missing from the `__gfx103__` group which results in `CK_BUFFER_RESOURCE_3RD_DWORD` never being defined for it. Adding in `gfx1033` to the missing files which should be the minimum fix to allow torch builds to pass. ## Test Plan Compile sample test file and target gfx1033 ``` ... #ifdef __HIP_DEVICE_COMPILE__ static_assert(CK_BUFFER_RESOURCE_3RD_DWORD == 0x31014000, "wrong device value"); #else static_assert(CK_BUFFER_RESOURCE_3RD_DWORD == -1, "wrong host value"); #endif ``` ## Test Result Prior to the applying patch, compilation fails with `error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'` After applying patch, test file compiles successfully. ## Submission Checklist - [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-03 19:45:41 +00:00
Illia Silin	6880e46a47	[rocm-libraries] ROCm/rocm-libraries#6147 (commit 8035856) [CK] Replace daily CI builds with mainline compiler with TheRock compiler. (#6147) ## Motivation Since the compiler team has deprecated the amd-mainline branch and switched to TheRock, we'll start building a docker image with TheRock artifacts and building/testing Ck with that. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-03 17:05:23 +00:00
Thrupti Raj Lakshmana Gowda	cf847f90ed	[rocm-libraries] ROCm/rocm-libraries#6102 (commit 827fd10) [CK Tile] Fix architecture-dependent EightWave assignment in cshuffle_epilogue (#6102) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Tile engine CI build on the develop branch started failing after a recent change(https://github.com/ROCm/rocm-libraries/pull/5218) in `cshuffle_epilogue.hpp`. The `EightWave` constant was unconditionally computed as `(MWave * NWave == 8)` for all architectures, but this logic is only valid for gfx9. On other architectures (e.g., gfx1201), `EightWave` must always be `false`, otherwise it leads to incorrect `BlockedXDLN_PerWarp` computation and build failures. ## Technical Details In `cshuffle_epilogue.hpp`, the `EightWave` static constexpr was set as: ```cpp static constexpr bool EightWave = (MWave NWave == 8); ``` This was applied regardless of the target GPU architecture. The fix uses a preprocessor guard to make this architecture-aware: - *gfx9 (`__gfx9__`):** `EightWave` is evaluated as `(MWave * NWave == 8)` — true or false depending on the wave configuration - All other architectures: `EightWave` defaults to `false` ## Test Plan - Tile engine CI build on develop branch ## Test Result - Pending CI ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-03 02:55:45 +00:00
Hosang Yoon	1dc35ff4ae	[rocm-libraries] ROCm/rocm-libraries#6038 (commit d7041a2) [CK_TILE] Restrict FMHA codegen to the kernel subset used by FlashAttention (#6038) ## Motivation Currently, the CK FlashAttention integration generates a broader FMHA kernel set than the FlashAttention wrappers can actually dispatch, which increases compile time without improving runtime coverage. ## Technical Details The FlashAttention CK wrappers do not use all logits/LSE variants emitted by the default FMHA codegen. The direct `fmha_fwd` path always uses softcap-disabled, LSE-enabled kernels, and the `fmha_fwd_splitkv` path only uses softcap-disabled kernels. This change trims codegen to that subset and stops generating the unused logits/LSE variants. This reduces the generated forward kernel set without changing `fmha_fwd_appendkv` or `fmha_bwd`. The reduced kernel set was validated by building and running the [FlashAttention](https://github.com/Dao-AILab/flash-attention) CK backend. Across targets, the total generated FMHA kernel count is reduced by: - `gfx942`: 29.3% - `gfx1100`: 33.7% - `gfx1201`: 31.3% ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> pytest test/test_flash_attn_ck.py from https://github.com/Dao-AILab/flash-attention ## Test Result all tests passed <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-03 00:18:21 +00:00
Christopher Millette	144854dba1	[rocm-libraries] ROCm/rocm-libraries#5938 (commit 73f3650) [CK_TILE] Optimize static_ford and sequence compile-time infrastructure (#5938) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Problem Each `static_for<0, N, 1>` instantiates its lambda N times (one per `number<I>` type). When nested, intermediate lambdas capture the outer loop variable (a different type per iteration), creating unique closure types. For a 3-level nest with M=4, N=4, K=2, this produces 4 + 16 + 32 = 52 IR functions, of which 20 are intermediate closures that get inlined away but still cost frontend compile time. ck_tile's `static_ford` was supposed to eliminate these intermediates (as old CK's PR #5031 did successfully), but it used a recursive `static_ford_impl` that recreated the same closure pattern plus added `reorder_old_to_new`/`reorder_new_to_old` overhead. Additionally, the sequence utility layer (`sequence_sort`, `is_valid_sequence_map`) used recursive template metaprogramming that generated O(N log N) intermediate types for every permutation validation — called on every `reorder_new_to_old`/`reorder_old_to_new` invocation. ## Changes ### 1. Replace `sequence_sort` with constexpr insertion sort Replace recursive merge sort (`sequence_sort_impl` + `sorted_sequence_merge_impl`, O(N log N) intermediate type instantiations) with constexpr insertion sort using `static_array`. O(1) template depth, same `::type` and `::sorted2unsorted_map` API. ### 2. Replace `is_valid_sequence_map` with constexpr check Replace sort-based permutation validation (which instantiated the full `sequence_sort` chain) with a constexpr "seen array" loop. O(N) constexpr steps instead of O(N log N) template instantiations. ### 3. Replace recursive `static_ford` with flat-loop `index_decomposer` Replace `static_ford_impl` (recursive `static_for` nesting + `pop_front`/`push_back` + `reorder_old_to_new` per iteration) with flat `index_decomposer` using pre-computed strides. Add `decompose_reordered` alias that folds reordering into decomposition, and `inverse_perm` helper that avoids the `sequence_map_inverse` → `is_valid_sequence_map` → `sequence_sort` chain. ### 4. Eliminate internal lambda via `ford_applier` The flat-loop approach still used `static_for` with a lambda, creating M×N internal lambda instantiations per call site. Replace with `ford_applier` struct that calls `f(decompose<I>{})` directly via fold expression — zero intermediate closures: ```cpp // Before: 2×M×N function instantiations static_for<0, MN, 1>{}([&](auto i) { f(decompose<i>{}); }); // After: M×N function instantiations (50% reduction) ford_applier<Decomposer, make_index_sequence<MN>>{}(f); ``` Also unified identity and non-identity order paths into a single template with `constexpr if`. ### 5. Fix const-qualified sequence handling Fix `is_valid_sequence_map` to handle const-qualified sequence types via `remove_cvref_t` in callers (`tensor_adaptor.hpp`, `tile_distribution_encoding.hpp`). ## Results (this PR only, without flattening) ### Build Time (Wilcoxon signed-rank, 7 paired trials, gfx942, load ~5) \| Target \| Base (s) \| Treat (s) \| Delta \| % \| Wins \| Significant? \| \|--------\|----------\|-----------\|-------\|---\|------\|-------------\| \| flatmm \| 160.1 \| 152.7 \| -7.4s \| -4.6% \| 6/7 \| YES (W+=1, p<0.05) \| \| universal_gemm \| 228.4 \| 224.7 \| -3.7s \| -1.6% \| 6/7 \| Trending (W+=4) \| Per-trial diffs (flatmm): [-6, -20, -9, -8, -8, 4, -5] Per-trial diffs (universal_gemm): [-2, -6, 4, -3, -2, -11, -6] ### IR Function Counts (device trace, gfx942) \| Target \| Metric \| Before \| After \| Delta \| % \| \|--------\|--------\|--------\|-------\|-------\|---\| \| universal_gemm \| InstantiateFunction \| 117,715 \| 109,165 \| -8,550 \| -7.3% \| \| universal_gemm \| CodeGen Function \| 47,912 \| 45,044 \| -2,868 \| -6.0% \| \| flatmm \| InstantiateFunction \| 100,939 \| 95,127 \| -5,812 \| -5.8% \| \| flatmm \| CodeGen Function \| 42,651 \| 40,367 \| -2,284 \| -5.4% \| Note: The `ford_applier` (commit 3) has minimal additional effect in this PR since ck_tile code does not yet use `static_ford` extensively. Its impact compounds when the follow-up flattening PR #5939 converts 124 `static_for` nests to `static_ford`. Combined results with #5939: flatmm -7.5% wall time (p<0.01), CodeGen -10.5%. ### ASM Equivalence 7/7 PASS — 979,943 lines of device assembly verified identical (gfx942 + gfx1100). TUs: universal_gemm, flatmm_basic, fmha_bwd, reduce, bscale. ## Test plan - [x] `test_ck_tile_static_ford`: 13 behavioral tests (identity/non-identity orders, 1D-4D, unit dimensions, edge cases) - [x] `ck_tile_unit_sequence`: 88 tests (11 new for sorted2unsorted_map, is_valid_sequence_map edge cases, sequence_unique_sort map round-trip) - [x] ASM equivalence verified (980K lines) - [x] Wilcoxon timing verified (7 trials, flatmm p<0.05) - [ ] CI 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-04-02 21:25:56 +00:00
Emily Martins	7cc9bae9d2	[rocm-libraries] ROCm/rocm-libraries#5722 (commit 55febd2) [CK Tile] Stream-K gtest Code Gen ## Motivation Stream-K was using the tile engine infrastructure for smoke tests. However, tile engine creates a different target per kernel instance, which has resulted in scalability issues when used in the context of unit tests. To avoid burdens on cmake configuration and build time, we have opted to remove our Stream-K tile engine tests. Instead, we use pure gtests with code gen to generate repetitive .cpp files. Note: This appears to change a lot of files because many files are removed since they are now generated at build time. ## Technical Details We originally used Tile Engine to facilitate code gen for unit tests since we found that pure gtests required the addition of many repetitive .cpp files of the following form: ```cpp #include "test_gemm_streamk_common_includes.hpp" template <typename Tuple> class TestCkTileStreamKBf8 : public TestCkTileStreamK<Tuple> { }; #define TEST_SUITE_NAME TestCkTileStreamKBf8 TYPED_TEST_SUITE(TestCkTileStreamKBf8, KernelTypesStreamKBf8); #include "test_gemm_streamk_atomic_cases.inc" #undef TEST_SUITE_NAME ``` Due to issues encountered with tile engine, we instead use pure gtests to generate the repetitive .cpp files. The code generator parses `KernelTypesStreamK` type aliases from the types header using a two-phase approach: 1. At configure time, CMake runs the Python script with `--list_files` to extract the type alias names from the header (test_gemm_streamk_types.hpp) and compute the list of .cpp file paths that will be generated. This lets CMake know the exact set of source files for each target. 2. At build time*, `add_custom_command` runs the script again with `--gen_files` to actually emit the .cpp files into the build directory, triggered only when the types header or generator script changes. With these changes, we've removed all Stream-K tile engine tests. There are now 5 targets for Stream-K GEMM tests: 1. test_ck_tile_streamk_atomic_smoke: smoke tests for Atomic reduction strategy (pipeline: compv3) 2. test_ck_tile_streamk_linear_smoke: smoke tests for Linear reduction strategy (pipeline: compv3) 3. test_ck_tile_streamk_tree_smoke: smoke tests for Tree reduction strategy (pipeline: compv3) 4. test_ck_tile_streamk_pipelines_smoke: smoke tests (smaller set) for pipelines other than compv3 - Since Stream-K can be thought of as a wrapper around universal GEMM, we don't need to extensively test each pipeline. So, we opt to run a few tests for different pipelines. Currently, this just consists of the mem pipeline, but compv4 is coming soon. 5. test_ck_tile_streamk_extended: extended tests ## Test Plan I have tests the gtests locally on gfx90a, gfx942, and gfx950. ## Test Result All local tests pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-02 21:07:13 +00:00
arai713	6d77edc3bd	[rocm-libraries] ROCm/rocm-libraries#5544 (commit 3be4095) [CK_TILE] Stream-K Tile Engine Fixes ## Motivation Stream-K GEMM in Tile Engine was unable to support instances where the matrix dimensions were not perfectly aligned due to bugs with padding support. This PR implements support for padding back into the Stream-K implementation in Tile Engine along with other minor fixes. Additionally, this PR introduces a benchmarking script that is standard for Tile Engine to run all compiled instances with user specified matrix dimensions. ## Technical Details - Fixed padding boolean comparison and parsing in gen_single so that padding flags from the config files are correctly propagated into the Stream-K template - Updated trait combo parsing to have the reduction_strategy for Stream-K in the correct order - Addition of gemm_streamk_benchmark.py to run all compiled Stream-K instances ## Test Plan Tested using the benchmark scripts to run instances. ## Test Result All instances passed. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-02 19:49:44 +00:00
Illia Silin	c73719a78f	[rocm-libraries] ROCm/rocm-libraries#6103 (commit c74e44d) Use ck_pytorch docker from private repo. ## Motivation Move the pytorch docker image used for CK testing into private repo. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-02 16:08:15 +00:00
Linjun-AMD	08792e0b31	[rocm-libraries] ROCm/rocm-libraries#5504 (commit 47f86c7) [CK Tile] Add sink token gradient support in FMHA backward pass (#5504) ## Motivation Adds sink token support to the FMHA backward kernel (dot_do_o pipeline): ## Technical Details - Extend BlockFmhaBwdOGradDotOPipelineProblem with LSEDataType - Add sink_ptr/d_sink_ptr/lse_ptr/nhead to FmhaBwdOGradDotOCommonKargs - Compute per-head sink gradient via atomic accumulation in the pipeline - Update example runner with reference validation for sink gradient ## Test Plan Add new test case ## Test Result WIP ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-02 03:17:45 +00:00
Yaswanth Raparti	c1127a36f5	[rocm-libraries] ROCm/rocm-libraries#5676 (commit 1d18339) [CK][CK TILE]Autotuning heuristics infra for universal GEMM kernel selection (#5676) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation This PR adds ML-based kernel selection heuristics to the CK Tile dispatcher, enabling fast and accurate automatic kernel selection for Universal Gemm kernels. Instead of requiring exhaustive search through 4600+ kernel configurations (taking ~46 seconds per problem shape), the ML heuristic predicts optimal kernels in microseconds while achieving >98% of oracle-best performance. ## Technical Details ML infrastructure https://github.com/ROCm/rocm-libraries/tree/users/vanantha/ck/dispatcher-heuristics/projects/composablekernel/dispatcher/heuristics * Feature Engine ([feature_engine.py](https://github.com/ROCm/rocm-libraries/blob/users/vanantha/ck/dispatcher-heuristics/projects/composablekernel/dispatcher/heuristics/feature_engine.py)): 55-feature extraction including problem dimensions, kernel configuration, tile efficiency, and hardware profile * Training Pipeline ([train.py](https://github.com/ROCm/rocm-libraries/blob/users/vanantha/ck/dispatcher-heuristics/projects/composablekernel/dispatcher/heuristics/train.py)): LightGBM regression with log-transform, GroupKFold cross-validation, warm-start support * Predictor ([predict.py](https://github.com/ROCm/rocm-libraries/blob/users/vanantha/ck/dispatcher-heuristics/projects/composablekernel/dispatcher/heuristics/predict.py)): Kernel ranking and TFLOPS prediction for problem shapes * Evaluation ([evaluate.py](https://github.com/ROCm/rocm-libraries/blob/users/vanantha/ck/dispatcher-heuristics/projects/composablekernel/dispatcher/heuristics/evaluate.py)): Comprehensive metrics including efficiency, NDCG@k, shape family analysis Data Generation Tools: * [generate_benchmark_data.py](https://github.com/ROCm/rocm-libraries/blob/users/vanantha/ck/dispatcher-heuristics/projects/composablekernel/dispatcher/heuristics/generate_benchmark_data.py): Build and benchmark kernels across diverse problem shapes * [convert_json_to_parquet.py](https://github.com/ROCm/rocm-libraries/blob/users/vanantha/ck/dispatcher-heuristics/projects/composablekernel/dispatcher/heuristics/convert_json_to_parquet.py): Convert benchmark JSON to training-ready parquet format * [data_pipeline.py](https://github.com/ROCm/rocm-libraries/blob/users/vanantha/ck/dispatcher-heuristics/projects/composablekernel/dispatcher/heuristics/data_pipeline.py): Parse streaming benchmark logs into canonical datasets Examples * [09_ml_heuristic.cpp](https://github.com/ROCm/rocm-libraries/blob/users/vanantha/ck/dispatcher-heuristics/projects/composablekernel/dispatcher/examples/gemm/cpp/09_ml_heuristic.cpp): C++ example demonstrating ML-based kernel selection * [09_ml_heuristic.py](https://github.com/ROCm/rocm-libraries/blob/users/vanantha/ck/dispatcher-heuristics/projects/composablekernel/dispatcher/examples/gemm/python/09_ml_heuristic.py): Python example with validation Pre-trained Models (projects/composablekernel/dispatcher/heuristics/models/): * gemm_universal_fp8_gfx950/: fp8 RCR model (42K trees, 97.51% mean efficiency) * gemm_universal_fp16_gfx950/: fp16 RCR model (20K trees, 99.36% mean efficiency) ## Test Plan * Evaluated on 25 diverse shapes for fp16, 168 shapes for fp8 * All shape families tested: tiny M (M<8), small M, medium M, large M (M≥1024) * All pipeline types: compv3, compv4, mem ## Test Result fp16 Model (gfx950, RCR layout) * Mean Efficiency: 99.36% * P10 Efficiency: 98.05% (90th percentile of shapes achieve ≥98% of oracle best) * Min Efficiency: 95.45% fp8 Model (gfx950, RCR layout) * Mean Efficiency: 98.28% (original), 97.51% (wide coverage) * P10 Efficiency: 94.64% (original), 93.89% (wide coverage) * Min Efficiency: 84.5% ## Submission Checklist - [x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-02 02:26:32 +00:00
Jobbins	404a5ce1a4	[rocm-libraries] ROCm/rocm-libraries#6107 (commit e69d1b2) [CK] poll every 6 hours as workaround	2026-04-01 19:53:41 +00:00
Chinmay Dattanand Kuchinad	2bb69a24ea	[rocm-libraries] ROCm/rocm-libraries#5776 (commit ee1bbcb) [CK] Fix async pivot mismatch in persistent GEMM kernel scheduler (#5776) ## Motivation Fix pivot mismatch in the persistent GEMM kernel's async input scheduler that causes GPU hangs and incorrect results when used with AsyncTP (Asynchronous Tensor Parallelism) on ROCm. PyTorch's `_fused_all_gather_matmul_native` uses this persistent GEMM kernel with chunk signals to overlap communication and computation. The pivot mechanism ensures each rank starts computing from its own local shard first (which is already available), then moves to remote chunks as they arrive over the network. Because of the pivot mismatch, the kernel frequently waits on signals for chunks that have not yet arrived, while attempting to read data from completely different chunks. This synchronization desync reliably triggers infinite hangs during multi-GPU native AsyncTP execution. This fix is required to enable functional AsyncTP support on ROCm. ## Technical Details In the persistent kernel loop (`UniversalGemmKernel::operator()`), the M-tile coordinate used for data selection (`i_m`) and the M-tile coordinate used for the chunk-signal wait (`chunk_idx`) were derived from inconsistent bases: * `i_m` was computed from the unpivoted tile index `iM`. * `chunk_idx` was computed from the pivoted expression `(iM + tile_idx_pivot)`. This means the kernel could wait for chunk N's signal but then read from chunk M's memory, or vice versa. The mismatch scales with GPU count: with 2 GPUs ~50% of tiles are wrong, with 4 GPUs ~75%, etc. The Fix: Introduce a single pivoted M-tile index (`iM_eff`) and derive both `i_m` and `chunk_idx` from it. This guarantees the kernel always waits for the correct chunk before reading its data. (Note: Minor cosmetic `clang-format` changes were also pulled in alongside the fix). ## Test Plan 1. Build PyTorch with this CK change. 2. Run the specific multi-GPU AsyncTP native test: `timeout 180s env HIP_VISIBLE_DEVICES=0,1 pytest test/distributed/test_symmetric_memory.py -k test_fused_all_gather_matmul_native -q -s -x` ## Test Result Tests verify correct overlapping execution without hangs or accuracy mismatches when running the AsyncTP native path with non-zero pivots. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-01 16:22:08 +00:00
Jobbins	9426f49b52	[rocm-libraries] ROCm/rocm-libraries#6064 (commit cce30ab) [CK] poll develop every 15 minutes for changes	2026-04-01 14:35:42 +00:00
Fu-Cheng Tsai	a502e5a00b	[rocm-libraries] ROCm/rocm-libraries#5798 (commit 7acd4e7) [CK_TILE] Update gfx12 FMHA forward kernel configs	2026-04-01 14:23:38 +00:00
aledudek	119712bd90	[rocm-libraries] ROCm/rocm-libraries#4469 (commit 0844cb0) [CK_TILE] Add pooling in tile_engine ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> Add pooling in ck tile engine ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-01 07:32:36 +00:00
Yi DING	791afc6465	[rocm-libraries] ROCm/rocm-libraries#5991 (commit 8d85e8e) [CK_TILE] Fix FMHA BWD IGLP incorrect results due to AGPR misallocation (#5991) ## Motivation After PR #5790 removed the `if constexpr(FmhaMask::IsMasking)` guard around the `num_total_loop <= 0` early-exit check, the IGLP pipeline (`BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`) produces incorrect dK/dV gradients for non-masking kernels (even with fix in #5915). Assembly inspection confirms that the CFG change causes the LLVM register allocator to reuse AGPR accumulators as scratch destinations in the dK/dV reduction loop, breaking the loop-carried accumulation across Q-tile iterations. ## Technical Details - Add `[[unlikely]]` to the `num_total_loop <= 0` early-exit in `BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`. This attribute is load-bearing: it restores the CFG shape that the register allocator needs to correctly assign dedicated AGPRs to each column of the dK/dV accumulator. - Only the IGLP pipeline is affected; the other two BWD pipelines do not exhibit this issue. ## Test Plan ## Test Result ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-01 05:45:19 +00:00
Estevan Vedovelli	a33b5be1b9	[rocm-libraries] ROCm/rocm-libraries#6022 (commit 54b284a) [CK] contraction: extend GetTypeString() to include layout-differentiating params (#6022) ## Motivation Consumers that identify kernels by their `GetTypeString()` (such as hipTensor's actor-critic kernel selection, which hashes the string into a stable cross-platform UID) were silently dropping one of two colliding variants during registry insertion. `GetTypeString()` in `DeviceContractionMultipleD_Xdl_CShuffle` previously printed 13 template parameters, omitting `ABlockTransferSrcScalarPerVector`, `BBlockTransferSrcScalarPerVector`, `ABlockLdsExtraM`, and `BBlockLdsExtraN`. These four parameters determine the block-transfer access width and LDS padding strategy, and are precisely what differentiates the `kk`, `kn`, `mk`, and `mn` layout variants from one another when all other geometry parameters are equal. Two instantiations with identical 13-parameter strings are distinct C++ types that accept different stride layouts and reject each other's arguments via `IsSupportedArgument`. This patch extends the output to 17 parameters so that every distinct template instantiation of this class produces a unique `GetTypeString()`. ## Technical Details `include/ck/tensor_operation/gpu/device/impl/device_contraction_multiple_d_xdl_cshuffle.hpp`: - extend `GetTypeString()` from 13 to 17 parameters including `ABlockTransferSrcScalarPerVector`, `BBlockTransferSrcScalarPerVector`, `ABlockLdsExtraM`, and `BBlockLdsExtraN`. ## Test Plan Build CK and hipTensor with these changes, and verify hipTensor can differentiate and select the correct kernels with layout variations. ## Test Result CK is building correctly and hipTensor is selecting the kernels correctly. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-31 15:19:43 +00:00
Bartłomiej Kocot	ef4ff4667d	[rocm-libraries] ROCm/rocm-libraries#5842 (commit 04c5690) [CK][CK Tile] Force padding for atomic_add bf16 C tensor (#5842) ## Motivation Force padding for atomic_add bf16 C tensor to avoid memfaults. ## Technical Details - add global atomic add for bf16 and enable them - add padding for atomic add bf16 due to the lack of oob - remove padding for not continous dims in conv for other cases - minor bwd data conv fixes ## Test Plan test_grouped_conv_*_tile ## Test Result pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-31 08:03:41 +00:00
jakpiase	66dc81d530	[rocm-libraries] ROCm/rocm-libraries#5729 (commit 516c974) [CK_TILE] Changed cshuffle LDS descriptor to naive layout (#5729) ## Motivation This PR changes gemm/convolution cshuffle layout into plain one. to improve cshuffle operation performance. ## Technical Details The purpose is that before this change the cshuffle layout was having some descriptor transformations that were probably aimed at reducing LDS bank conflicts, but the transformations itself were terribly slow, which negatively impacted the performance. ## Test Plan There is no need for additional tests, since current tests cover this functionality.	2026-03-31 03:40:25 +00:00
Illia Silin	e6b8094f94	[rocm-libraries] ROCm/rocm-libraries#5921 (commit 032ac1b) [CK] fix clang lifetimebound errors with staging compiler (#5921) ## Motivation The ROCm staging compiler (newer Clang) enforces `[[clang::lifetimebound]]` annotations on methods that return references or pointers to internal object data. Without these annotations, the staging compiler emits compilation errors for container accessor methods across the CK and CK Tile namespaces. ## Technical Details Adds `[[clang::lifetimebound]]` to all reference/pointer-returning accessors in core container types: `ck::` namespace: - `Array` -- `At()`, `operator[]`, `operator()`, `begin()`, `end()` - `index_array` -- `operator[]` - `StaticallyIndexedArray_v2` -- `At()`, `operator[]`, `operator()` - `IndexLookupTable` -- `operator[]` `ck_tile::` namespace: - `array` -- `get(i)`, `at()`, `operator[]`, `operator()` - `static_array` -- `operator[]` - `thread_buffer` -- `get(i)`, `at()`, `operator[]`, `operator()` - `make_kernel()` -- parameter pack Also removes the unused `instance_index` variable from `batched_gemm_reduce_fp16.cpp` and simplifies its argument parsing accordingly. ## Test Plan - Compile with the staging compiler to verify all lifetimebound errors are resolved - Existing tests pass unchanged -- the attribute is a compile-time annotation with no runtime effect ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-30 14:20:20 +00:00
Hosang Yoon	2dcae9d173	[rocm-libraries] ROCm/rocm-libraries#5977 (commit 794bea7) [CK_TILE] Fix Windows build in FMHA head grouping ## Motivation This is a follow-up fix for [PR #5018](https://github.com/ROCm/rocm-libraries/pull/5018). [PR #5018](https://github.com/ROCm/rocm-libraries/pull/5018) added LLC-aware FMHA head grouping / head-major scheduling on RDNA, but it also introduced Linux-only code paths, including `<dirent.h>`, which break Windows builds. This change fixes that by guarding the Linux-specific LLC probing logic so non-Linux platforms can still build correctly. ## Technical Details - Guard `<dirent.h>` with `#ifdef __linux__` - Guard KFD sysfs traversal logic with `#if defined(__linux__)` - On non-Linux platforms, return `0` from `get_kfd_sysfs_llc_cache_bytes()` - Preserve existing fallback behavior through: - `CK_TILE_FMHA_LLC_CACHE_MB` - arch-based default LLC sizes - no head grouping when no LLC size can be resolved ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-30 14:19:19 +00:00
Jeff Huang	7968368d92	[rocm-libraries] ROCm/rocm-libraries#5918 (commit a7e2c67) [CK][CK_TILE] Add fp8bf16 hdim=256 tile for batch prefill (#5918) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation FP8 batch prefill kernels currently only support head_dim=128. Models with head_dim=256 hit the "invalid argument for batch_prefill" error because no matching kernel variant exists in the codegen dispatch. ## Technical Details Add a hdim=256 tile size entry for fp8bf16 in the batch prefill codegen recipe (`fmha_batch_prefill.py`). Tile configuration: `FmhaFwdTileSize(128, 128, 32, 256, 32, 256, 4,1,1, 4,1,1, 32,32,32, 32,32,32, -1)` - bm0=128, bn0=128 (Q/K tile sizes) - bn1=256, bk0max=256 (V head_dim=256) - Warp MFMA 32x32x32 (fp8 MFMA instructions) This mirrors the existing bf16/fp16 hdim=256 tile but uses fp8 warp sizes. ## Test Plan Tested on both MI308X (gfx942) and MI355X (gfx950) via aiter batch prefill test with the following matrix: - page_size: {1, 16, 1024} - kv_layout: {linear, vectorized} - lookup_table: {sglang, vllm} - causal: {true, false} - logits_soft_cap: {0.0, 30.0} - contiguous_kv: {true, false} ## Test Result MI308X (gfx942): 160 passed, 32 skipped (page_size=1 + vectorized not applicable) MI355X (gfx950): 120 passed, 72 skipped (pre-existing ROCm 7.2 compiler issue with causal + no softcap) No register spills on either platform. ### Profiling — MI355X (gfx950), FP8 pertensor, hdim=256, seqlen=1024, 8 heads \| page_sz \| kv_layout \| table \| causal \| soft_cap \| time_us \| TFLOPS \| \|---------\|-----------\|-------\|--------\|----------\|---------\|--------\| \| 1 \| linear \| sglang \| False \| 0.00 \| 55.01 \| 156.16 \| \| 1 \| linear \| vllm \| False \| 0.00 \| 55.12 \| 155.84 \| \| 1 \| linear \| sglang \| False \| 30.00 \| 62.63 \| 137.16 \| \| 1 \| linear \| vllm \| False \| 30.00 \| 62.16 \| 138.20 \| \| 1 \| linear \| sglang \| True \| 30.00 \| 64.09 \| 67.01 \| \| 1 \| linear \| vllm \| True \| 30.00 \| 63.85 \| 67.27 \| \| 16 \| linear \| sglang \| False \| 0.00 \| 57.00 \| 150.69 \| \| 16 \| vectorized \| sglang \| False \| 0.00 \| 57.55 \| 149.25 \| \| 16 \| linear \| vllm \| False \| 0.00 \| 56.80 \| 151.23 \| \| 16 \| vectorized \| vllm \| False \| 0.00 \| 57.32 \| 149.87 \| \| 16 \| linear \| sglang \| False \| 30.00 \| 64.77 \| 132.62 \| \| 16 \| vectorized \| vllm \| False \| 30.00 \| 63.54 \| 135.18 \| \| 16 \| linear \| sglang \| True \| 30.00 \| 66.84 \| 64.26 \| \| 16 \| vectorized \| vllm \| True \| 30.00 \| 66.12 \| 64.96 \| \| 1024 \| linear \| sglang \| False \| 0.00 \| 58.25 \| 147.46 \| \| 1024 \| vectorized \| sglang \| False \| 0.00 \| 57.53 \| 149.31 \| \| 1024 \| linear \| vllm \| False \| 0.00 \| 58.06 \| 147.94 \| \| 1024 \| vectorized \| vllm \| False \| 0.00 \| 57.55 \| 149.27 \| \| 1024 \| linear \| sglang \| False \| 30.00 \| 65.38 \| 131.38 \| \| 1024 \| vectorized \| vllm \| False \| 30.00 \| 63.64 \| 134.98 \| \| 1024 \| linear \| sglang \| True \| 30.00 \| 66.85 \| 64.25 \| \| 1024 \| vectorized \| vllm \| True \| 30.00 \| 65.26 \| 65.81 \| ### Profiling — MI308X (gfx942), FP8 pertensor, hdim=256, seqlen=1024, 8 heads \| page_sz \| kv_layout \| table \| causal \| soft_cap \| time_us \| TFLOPS \| \|---------\|-----------\|-------\|--------\|----------\|---------\|--------\| \| 1 \| linear \| sglang \| False \| 0.00 \| 110.18 \| 77.96 \| \| 1 \| linear \| vllm \| True \| 30.00 \| 134.33 \| 31.97 \| \| 1 \| linear \| sglang \| True \| 30.00 \| 134.59 \| 31.91 \| \| 16 \| linear \| sglang \| False \| 0.00 \| 115.43 \| 74.42 \| \| 16 \| vectorized \| sglang \| False \| 0.00 \| 106.11 \| 80.95 \| \| 16 \| linear \| vllm \| False \| 0.00 \| 116.34 \| 73.83 \| \| 16 \| vectorized \| vllm \| False \| 0.00 \| 106.17 \| 80.91 \| \| 16 \| linear \| sglang \| False \| 30.00 \| 135.61 \| 63.34 \| \| 16 \| vectorized \| vllm \| False \| 30.00 \| 122.37 \| 70.20 \| \| 16 \| linear \| sglang \| True \| 0.00 \| 117.44 \| 36.57 \| \| 16 \| vectorized \| vllm \| True \| 0.00 \| 108.81 \| 39.47 \| \| 16 \| linear \| sglang \| True \| 30.00 \| 139.43 \| 30.80 \| \| 16 \| vectorized \| vllm \| True \| 30.00 \| 125.87 \| 34.12 \| \| 1024 \| linear \| sglang \| False \| 0.00 \| 110.65 \| 77.63 \| \| 1024 \| vectorized \| sglang \| False \| 0.00 \| 101.70 \| 84.46 \| \| 1024 \| linear \| vllm \| False \| 0.00 \| 111.71 \| 76.89 \| \| 1024 \| vectorized \| vllm \| False \| 0.00 \| 101.55 \| 84.59 \| \| 1024 \| linear \| sglang \| False \| 30.00 \| 129.33 \| 66.42 \| \| 1024 \| vectorized \| vllm \| False \| 30.00 \| 120.95 \| 71.02 \| \| 1024 \| linear \| sglang \| True \| 0.00 \| 112.26 \| 38.26 \| \| 1024 \| vectorized \| vllm \| True \| 0.00 \| 103.02 \| 41.69 \| \| 1024 \| linear \| sglang \| True \| 30.00 \| 133.73 \| 32.12 \| \| 1024 \| vectorized \| vllm \| True \| 30.00 \| 124.75 \| 34.43 \| ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-30 10:21:33 +00:00
Yi DING	fb64a4453c	[rocm-libraries] ROCm/rocm-libraries#5915 (commit a72cf7d) [CK_TILE] Fix FMHA BWD register pressure by wrapping num_total_loop with amd_wave_read_first_lane (#5915) ## Motivation In three FMHA backward pipelines, `num_total_loop` is computed without `amd_wave_read_first_lane()`, so the compiler treats it as a VGPR even though it is logically uniform across all lanes. This raises register pressure, and under high pressure the compiler may reuse VGPRs across overlapping live ranges. This was confirmed via assembly inspection: the compiler reused `v52:v53` as both the B-matrix input for dK MFMAs and an intermediate value for dV, producing incorrect dK/dV gradients. ## Technical Details Wrap `num_total_loop` with `amd_wave_read_first_lane()` in three pipelines: - `block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr` - `block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr_iglp` - `block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr` This promotes `num_total_loop` to an SGPR, eliminating the excess register pressure and the incorrect VGPR reuse. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-30 01:45:16 +00:00
Jan Patrick Lehr	b6bbada9f1	[rocm-libraries] ROCm/rocm-libraries#5639 (commit a65e645) [CK] More lifetime-warning suppression ## Motivation The staging compiler picked up another change from upstream that leads to more lifetime-analysis warnings. This breaks the build, given CK is built with -Werror. As a result, compiler promotion is blocked. ## Technical Details This patch adds the pragma push diagnostics to ignore the lifetime-warnings in the modified files to unblock compiler promotion. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-28 11:20:51 +00:00
Linjun-AMD	3b55a05e71	[rocm-libraries] ROCm/rocm-libraries#5849 (commit d9b89b2) [CK_TILE ]Revert "[CK_TILE] Enable MXFP6 for MX GEMM op (#5095)" (#5849) This reverts commit 7e55766ddf7e9e20791b0e4e2d7b4026cf16b637. ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-27 20:37:23 +00:00
Bartłomiej Kocot	c28d0033d7	[rocm-libraries] ROCm/rocm-libraries#5785 (commit d8ecfc1) [CK] Fix min k_batch calculation in conv kernels ## Motivation Avoid division by 0 and remove not needed "-1". ## Technical Details Our div up implementation return lower value if input is divisible. There is no need to subtract 1. ## Test Plan test_grouped_conv_bwd_weight ## Test Result Passed locally. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-1019	2026-03-27 15:38:21 +00:00
Illia Silin	4c926497ad	[rocm-libraries] ROCm/rocm-libraries#5829 (commit 19b2813) [CK] Fix error in dockerfile when building staging compiler. (#5829) ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-27 15:37:21 +00:00
Johannes Graner	58475d3f45	[rocm-libraries] ROCm/rocm-libraries#5393 (commit d51b649) [CK Tile] StreamK support for Bwd Weight grouped convolutions (#5393) ## Motivation Add StreamK work distribution to the CK Tile grouped convolution backward weight kernel. Split-K divides the K-dimension uniformly across a fixed `k_batch`, which causes load imbalance when the number of output tiles doesn't evenly fill the GPU. StreamK distributes total K-iterations evenly across workgroups, improving utilization on these shapes. ## Technical Details StreamK is added as an `if constexpr` branch in the existing kernel, selected by the `TilePartitioner_` template parameter. Two reduction strategies are supported: - Linear: tile-starter sequentially accumulates partials from contributing CTAs - Tree: pairwise binary tree reduction (O(log n) depth, faster for many contributors) Both persistent and non-persistent data-parallel (DP) sections are supported. Key changes: - `grouped_convolution_backward_weight_kernel.hpp`: StreamK execution path with `RunStreamK`/`RunStreamKLoop`, partial store/load via workspace, flag-based cross-CTA synchronization, `GridSize`/`MakeKernelArgs`/`GetWorkSpaceSize` extensions - `streamk_common.hpp`: Shared `StreamKReductionOps` (reduction helpers) and `StreamKDispatch` (persistent/non-persistent DP dispatch), used by both GEMM and Conv StreamK kernels - `streamk_gemm_kernel.hpp`: Refactored to use shared helpers - Merged split-K and StreamK example invokers via `PartitionerPolicy` template parameter - StreamK example binary with `--streamk_reduction=linear\|tree` and `--streamk_persistent=0\|1` - CK Builder integration: `SpecifiesStreamK` concept, `TilePartitionerType` factory helper, `InstanceTraits` with StreamK fields - 30 tests: host-side, GPU end-to-end (Linear + Tree + Persistent DP), negative, builder regression ### Performance (MI355X, gfx950) Speedup relative to best split-K (sweep over k_batch={1,2,4,8,16,32}): \| Shape \| 16x64 tiles \| \| 128x128 tiles \| \| \|---\|---\|---\|---\|---\| \| \| Split-K \| StreamK \| Split-K \| StreamK \| \| 1x1 128x128 N=32 28x28 \| 1.00x \| 0.54x \| 1.00x \| 0.81x \| \| 3x3 128x128 N=32 14x14 \| 1.00x \| 0.59x \| 1.00x \| 0.62x \| \| 1x1 256x64 N=32 56x56 \| 1.00x \| 0.83x \| 1.00x \| 1.83x \| \| 3x3 512x512 N=2 7x7 \| 1.00x \| 1.12x \| 1.00x \| 0.62x \| \| 1x1 1024x1024 N=4 7x7 \| 1.00x \| 1.09x \| 1.00x \| 0.60x \| \| 3x3 128x128 N=32 28x28 \| 1.00x \| 0.44x \| 1.00x \| 0.96x \| \| 3x3 256x256 N=32 14x14 \| 1.00x \| 0.67x \| 1.00x \| 0.93x \| \| 3x3 512x512 N=32 7x7 \| 1.00x \| 0.98x \| 1.00x \| 1.16x \| StreamK's value depends on tile config: with larger tiles (fewer output tiles), StreamK delivers up to 1.83x speedup on bottleneck shapes and up to 1.16x on typical large-channel convolutions. Tree reduction consistently outperforms Linear when multiple CTAs contribute to the same tile (up to 2.87x faster), due to O(log n) reduction depth vs O(n) sequential accumulation. The table reports the best of Linear and Tree for each shape. ## Test Plan ```bash ninja -C build test_ck_tile_grouped_conv_bwd_weight_streamk ./build/bin/test_ck_tile_grouped_conv_bwd_weight_streamk # Builder tests (requires CK_EXPERIMENTAL_BUILDER=ON) ninja -C build check-builder ``` 30 tests covering: - Host-side: type traits, kernel args construction, grid size, workspace size - GPU end-to-end (Linear + Tree): small/medium shapes, multi-group, stride>1, pure-DP degeneration, single-tile all-SK, large GemmK, higher occupancy - Persistent DP: Linear + Tree with persistent data-parallel dispatch - Negative: `IsSupportedArgument` rejects unaligned K and C - Builder: Create (instance string validation) + Execution (reference comparison) + instance string regression ## Test Result All 30 conv StreamK tests pass on MI355X (gfx950). 64/64 GEMM StreamK tests pass. Full `check-builder` suite passes. Tolerances computed dynamically using `calculate_rtol_atol` pattern (fp16 ULP-aware). ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-27 09:18:14 +00:00
arai713	36f2ec23f5	[rocm-libraries] ROCm/rocm-libraries#5445 (commit 2cdbf8b) [CK_TILE] Support for CompV4 pipeline in Stream-K GEMM (#5445) ## Motivation This PR is extending the pipeline support for Stream-K GEMM by adding the CompV4 pipeline. Additional pipelines will be added in subsequent PRs. ## Technical Details - Enable the CompV4 pipeline by adding an option to set DoubleSMemBuffer to true if the CompV4 pipeline has been selected as it requires double buffered shared memory - Addition of CompV4 pipeline into the extended tests: kernel instances mirror the existing CompV3/Mem configurations (same layout permutations, data types, and tile sizes) with the pipeline type set to CompV4. - Addition of CompV4 pipeline into smoke tests (generated using Tile Engine) ## Test Plan These were tested using the existing smoke and extended tests. ## Test Result All tests passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-27 08:13:27 +00:00
Yi DING	47a04fda08	[rocm-libraries] ROCm/rocm-libraries#5790 (commit c132b5a) [CK_TILE] Fix NaN for FMHA BWD When seq_q=0 ## Motivation This PR addresses NaNs in the FMHA backward (dQ/dK/dV) path when the effective query sequence length for a tile is zero, by ensuring the per-tile pipelines exit early with zeroed accumulators and by avoiding an early kernel return that prevented writing out cleared gradients. ## Technical Details - Add unconditional early-exit in the dK/dV pipelines when `num_total_loop <= 0` (no work), returning zeroed accumulators. - Adjust group-mode kernel early-return logic to only return when both `seqlen_q` and `seqlen_k` are zero, allowing blocks to run and store cleared dK/dV when `seqlen_q == 0`. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-27 07:54:53 +00:00
Yaswanth Raparti	e2470e837a	[rocm-libraries] ROCm/rocm-libraries#5880 (commit a6b6c05) [CK][CK_TILE] Fix CTest parsing to handle all test number formats (#5880) ## Motivation Fix a bug in the smart-build --ctest-only filter that was incorrectly excluding tests with numbers less than 100. ## Technical Details The issue was caused by CTest formatting test numbers with variable spacing based on the number of digits: - "Test `#1`: name (3 spaces for tests 1-9)" - "Test `#79`: name (2 spaces for tests 10-99)" - "Test `#100`: name (1 space for tests 100+)" The previous code used `line.strip().startswith("Test #")` which only matched tests with a single space (i.e., test numbers >= 100). This caused tests like ck_tile_unit_sequence (Test #79) to be excluded from smart-build test selection, resulting in CTest failures when the binary wasn't built. Solution: Replace string matching with a regex pattern that handles all spacing variations: r'^\sTest\s+#\d+:\s(.+)$' ## Test Plan Tested with test numbers from 1 to 12345. ## Test Result - Before: 48 tests selected (only tests #100+) - After: 146 tests selected (all CTest-registered tests) ## Submission Checklist - [x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-03-27 06:34:12 +00:00
Illia Silin	2f98c7bbef	[rocm-libraries] ROCm/rocm-libraries#5891 (commit 82563ff) fix AITER docker setup ## Motivation Add a new python package required to build AITER. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-27 04:36:16 +00:00
Bartłomiej Kocot	1c95ce0668	[rocm-libraries] ROCm/rocm-libraries#5856 (commit 2d9a0a1) [CK] Fix unused param mask ## Motivation Compiler error caused by unused param mask. ## Technical Details Skip tests using param mask in test loop. ## Test Plan Current test improvements. ## Test Result Passed locally ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-27 03:58:37 +00:00
dependabot[bot]	6215bb8dbc	[rocm-libraries] ROCm/rocm-libraries#5896 (commit b7436b5) Bump requests from 2.32.5 to 2.33.0 in /projects/composablekernel/docs/sphinx (#5896) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Bumps [requests](https://github.com/psf/requests) from 2.32.5 to 2.33.0. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/psf/requests/releases">requests's releases</a>.</em></p> <blockquote> <h2>v2.33.0</h2> <h2>2.33.0 (2026-03-25)</h2> <p><strong>Announcements</strong></p> <ul> <li>📣 Requests is adding inline types. If you have a typed code base that uses Requests, please take a look at <a href="https://redirect.github.com/psf/requests/issues/7271">#7271</a>. Give it a try, and report any gaps or feedback you may have in the issue. 📣</li> </ul> <p><strong>Security</strong></p> <ul> <li>CVE-2026-25645 <code>requests.utils.extract_zipped_paths</code> now extracts contents to a non-deterministic location to prevent malicious file replacement. This does not affect default usage of Requests, only applications calling the utility function directly.</li> </ul> <p><strong>Improvements</strong></p> <ul> <li>Migrated to a PEP 517 build system using setuptools. (<a href="https://redirect.github.com/psf/requests/issues/7012">#7012</a>)</li> </ul> <p><strong>Bugfixes</strong></p> <ul> <li>Fixed an issue where an empty netrc entry could cause malformed authentication to be applied to Requests on Python 3.11+. (<a href="https://redirect.github.com/psf/requests/issues/7205">#7205</a>)</li> </ul> <p><strong>Deprecations</strong></p> <ul> <li>Dropped support for Python 3.9 following its end of support. (<a href="https://redirect.github.com/psf/requests/issues/7196">#7196</a>)</li> </ul> <p><strong>Documentation</strong></p> <ul> <li>Various typo fixes and doc improvements.</li> </ul> <h2>New Contributors</h2> <ul> <li><a href="https://github.com/M0d3v1"><code>@M0d3v1</code></a> made their first contribution in <a href="https://redirect.github.com/psf/requests/pull/6865">psf/requests#6865</a></li> <li><a href="https://github.com/aminvakil"><code>@aminvakil</code></a> made their first contribution in <a href="https://redirect.github.com/psf/requests/pull/7220">psf/requests#7220</a></li> <li><a href="https://github.com/E8Price"><code>@E8Price</code></a> made their first contribution in <a href="https://redirect.github.com/psf/requests/pull/6960">psf/requests#6960</a></li> <li><a href="https://github.com/mitre88"><code>@mitre88</code></a> made their first contribution in <a href="https://redirect.github.com/psf/requests/pull/7244">psf/requests#7244</a></li> <li><a href="https://github.com/magsen"><code>@magsen</code></a> made their first contribution in <a href="https://redirect.github.com/psf/requests/pull/6553">psf/requests#6553</a></li> <li><a href="https://github.com/Rohan5commit"><code>@Rohan5commit</code></a> made their first contribution in <a href="https://redirect.github.com/psf/requests/pull/7227">psf/requests#7227</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/psf/requests/blob/main/HISTORY.md#2330-2026-03-25">https://github.com/psf/requests/blob/main/HISTORY.md#2330-2026-03-25</a></p> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/psf/requests/blob/main/HISTORY.md">requests's changelog</a>.</em></p> <blockquote> <h2>2.33.0 (2026-03-25)</h2> <p><strong>Announcements</strong></p> <ul> <li>📣 Requests is adding inline types. If you have a typed code base that uses Requests, please take a look at <a href="https://redirect.github.com/psf/requests/issues/7271">#7271</a>. Give it a try, and report any gaps or feedback you may have in the issue. 📣</li> </ul> <p><strong>Security</strong></p> <ul> <li>CVE-2026-25645 <code>requests.utils.extract_zipped_paths</code> now extracts contents to a non-deterministic location to prevent malicious file replacement. This does not affect default usage of Requests, only applications calling the utility function directly.</li> </ul> <p><strong>Improvements</strong></p> <ul> <li>Migrated to a PEP 517 build system using setuptools. (<a href="https://redirect.github.com/psf/requests/issues/7012">#7012</a>)</li> </ul> <p><strong>Bugfixes</strong></p> <ul> <li>Fixed an issue where an empty netrc entry could cause malformed authentication to be applied to Requests on Python 3.11+. (<a href="https://redirect.github.com/psf/requests/issues/7205">#7205</a>)</li> </ul> <p><strong>Deprecations</strong></p> <ul> <li>Dropped support for Python 3.9 following its end of support. (<a href="https://redirect.github.com/psf/requests/issues/7196">#7196</a>)</li> </ul> <p><strong>Documentation</strong></p> <ul> <li>Various typo fixes and doc improvements.</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="`bc04dfd6da`"><code>bc04dfd</code></a> v2.33.0</li> <li><a href="`66d21cb07b`"><code>66d21cb</code></a> Merge commit from fork</li> <li><a href="`8b9bc8fc0f`"><code>8b9bc8f</code></a> Move badges to top of README (<a href="https://redirect.github.com/psf/requests/issues/7293">#7293</a>)</li> <li><a href="`e331a288f3`"><code>e331a28</code></a> Remove unused extraction call (<a href="https://redirect.github.com/psf/requests/issues/7292">#7292</a>)</li> <li><a href="`753fd08c5e`"><code>753fd08</code></a> docs: fix FAQ grammar in httplib2 example</li> <li><a href="`774a0b837a`"><code>774a0b8</code></a> docs(socks): same block as other sections</li> <li><a href="`9c72a41bec`"><code>9c72a41</code></a> Bump github/codeql-action from 4.33.0 to 4.34.1</li> <li><a href="`ebf7190679`"><code>ebf7190</code></a> Bump github/codeql-action from 4.32.0 to 4.33.0</li> <li><a href="`0e4ae38f0c`"><code>0e4ae38</code></a> docs: exclude Response.is_permanent_redirect from API docs (<a href="https://redirect.github.com/psf/requests/issues/7244">#7244</a>)</li> <li><a href="`d568f47278`"><code>d568f47</code></a> docs: clarify Quickstart POST example (<a href="https://redirect.github.com/psf/requests/issues/6960">#6960</a>)</li> <li>Additional commits viewable in <a href="https://github.com/psf/requests/compare/v2.32.5...v2.33.0">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=requests&package-manager=pip&previous-version=2.32.5&new-version=2.33.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end)	2026-03-26 22:01:37 +00:00
joyeamd	046d3ac274	[rocm-libraries] ROCm/rocm-libraries#5789 (commit 6654ca6) [CK][CK_TILE] Revert addional oob check in gemm IsSupported function (#5789) ## Motivation fix ck_tile's oob check. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-26 01:41:35 +00:00
Estevan Vedovelli	0004a37de5	[rocm-libraries] ROCm/rocm-libraries#5675 (commit fbd7fa7) [CK] Properly build HIPTENSOR_REQ_LIBS_ONLY targets when used in addition to MIOPEN_REQ_LIBS_ONLY (#5675) ## Motivation When building CK with both -DHIPTENSOR_REQ_LIBS_ONLY=ON and -DMIOPEN_REQ_LIBS_ONLY=ON, only MIOpen targets were being properly installed. This change is necessary to allow hipTensor to build with TheRock without the need to rebuild CK from source. ## Technical Details The solutions consists in considering both HIPTENSOR_REQ_LIBS_ONLY and MIOPEN_REQ_LIBS_ONLY when including hiptensor's targets in CMake, following the same approach used to the conv target (for MIOpen). ## Test Plan Manually test the build and installation with `-DHIPTENSOR_REQ_LIBS_ONLY=ON` and both `-DHIPTENSOR_REQ_LIBS_ONLY=ON -DMIOPEN_REQ_LIBS_ONLY=ON`, and verify that the proper files as installed. ## Test Result The build with `-DHIPTENSOR_REQ_LIBS_ONLY=ON` properly includes the targets contraction, reduction and other, while `-DHIPTENSOR_REQ_LIBS_ONLY=ON -DMIOPEN_REQ_LIBS_ONLY=ON` includes conv, contraction, reduction and other. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-25 23:59:53 +00:00
Illia Silin	86ec92f925	[rocm-libraries] ROCm/rocm-libraries#5571 (commit 8f60932) [CK] fix clang lifetime bound error in ck_builder. ## Motivation This resolves the compilation error with latest develop compiler branch. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-25 16:45:38 +00:00
Illia Silin	bee61860c2	[rocm-libraries] ROCm/rocm-libraries#5764 (commit f3c1232) Re-enable daily builds with staging compiler ## Motivation This should help us catch and fix any new compilation issues early on. ## Technical Details We now have three compiler profiles: * develop: slightly stabilized version of amd-staging with some of the obvious offending PRs reverted, 1-2 weeks behind amd-staging; * amd-mainline: more stable version of compiler, the baseline for all other branches, e.g., release, npi, etc. 2-4 weeks behind amd-staging. * amd-staging: latest compiler version where all new PRs land, often broken; ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: kensclin <lshyhchy@amd.com>	2026-03-25 16:37:58 +00:00
Ville Pietilä	ec2dbfbfde	[rocm-libraries] ROCm/rocm-libraries#5516 (commit ff3afda) [CK_TILE, CK_BUILDER] Add bwd data to CK Tile profiler (#5516) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation We want close the performance gap between old CK and CK Tile for bwd data convolutions. To achieve this, we need tow things - Configurations for the old CK kernel instances such that we can map them into CK Tile instances. - Support in CK profiler to run the CK Tile instance with the same API as for old CK instances. ## Technical Details Extracted kernel configurations from old CK. The codegen python script for CK Tile convs is extended to support also bwd data. The generated instances are added to the CMake build (target `device_grouped_conv_bwd_data_tile_instances`). A new profiler op (`grouped_conv_bwd_data_tile`) has been added to the CK Profiler. The API is same as for old CK's profiler op `grouped_conv_bwd_data`.	2026-03-25 14:36:11 +00:00
joyeamd	1834e318da	[rocm-libraries] ROCm/rocm-libraries#5697 (commit dd1c396) Revert "Ck/joye/revert oob check (#5640)" This reverts commit 552ab4880292694cb8261f40fa4223af52cb8419. ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-23 22:05:30 +00:00
andrew clark	5a4243096b	[rocm-libraries] ROCm/rocm-libraries#5713 (commit e179279) Adding New Notification Detection ## Motivation Restricting one of the notification failure patterns to match a specific missing drivers log pattern. This will help reduce the noise of erroneous logs. Also adding a new failure pattern to notify us of Github access issues. ## Technical Details - Set the failure pattern to match the exact failure observed in the logs. - Switching to a plain substring search so special characters are handled literally. - Added a new failure pattern for Github access errors. ## Test Plan - Force a failure using the known failure patterns. ## Test Result The forced failures were triggered and caught by the notification system. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-23 20:57:55 +00:00
Eiden Yoshida	ba2fb0224f	[rocm-libraries] ROCm/rocm-libraries#5691 (commit 2fbb1fc) [CK] MICI: Revert "add self healing to ref repo" The check may not be working as intended, causing premature deletion of reference repositories	2026-03-23 14:16:53 +00:00

1 2 3 4 5 ...

3217 Commits