composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-20 12:59:49 +00:00

Author	SHA1	Message	Date
Gino Lu	fb75da2467	sparse_attn: wire -mask and -attention_sink (block-map prune + attn mask)	2026-05-19 23:22:00 -04:00
Gino Lu	b3ea819ff7	sparse_attn: annotate PV-skip chart with speedup vs dense baseline	2026-05-19 22:42:49 -04:00
Gino Lu	9e3f8838de	sparse_attn: drop stale FMHA-vs-sparge perf section from README	2026-05-19 22:37:55 -04:00
Gino Lu	d939c3b4fc	sparse_attn: split-launch dispatch + 3-mode PV-skip - Per-head pv_threshold via head_remap LUT (CLI: -pv_threshold_per_head); sentinel 1e30 routes to kEnablePVSkip=false bucket - kEnablePVSkip bool → PVSkipMode enum {kNone, kPerWarp, kPerBlock}; new kPerBlock matches upstream sm80 (LDS vote, V loads unconditional). CLI: -pv_mode={none,warp,block}, default warp - README: PV-skip modes section + MI300X 3-curve sparsity chart Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-05-19 21:45:23 -04:00
Gino Lu	304c1f9244	Merge remote-tracking branch 'origin/develop' into ginolu/sparge_attention	2026-05-19 21:34:32 -04:00
Illia Silin	37950ea4eb	[rocm-libraries] ROCm/rocm-libraries#7547 (commit 7e032ad) [CK] fix daily builds for pytorch ## Motivation This will restore the daily builds that test whether the latest pytorch code can build with the latest CK code (pulled from the standalone CK repo). ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-19 14:14:32 +00:00
Aaryaman Vasishta	457f153b69	[rocm-libraries] ROCm/rocm-libraries#7016 (commit 2b73c00) [CK] Fix RDNA3 FMHA tile-load paths ## Summary Fix CK tile FMHA paths needed for RDNA3/RDNA4 targets. ## Details This PR addresses RDNA-specific issues hit while enabling xFormers CK FMHA on gfx11/gfx12: - On RDNA3, update FMHA P tile handling so the layout consumed by the second GEMM matches the WMMA path. ## Testing Validated downstream with xFormers CK/FMHA on gfx1201/gfx1151. ```text pytest --import-mode=importlib -q \ tests/test_mem_eff_attention.py::test_forward \ tests/test_mem_eff_attention.py::test_backward \ tests/test_mem_eff_attention.py::test_dropout_ck 3844 passed, 5244 skipped, 26 warnings	2026-05-19 13:42:43 +00:00
Po Yen Chen	424dfec6e4	[rocm-libraries] ROCm/rocm-libraries#7530 (commit 378e049) [CK] Fix FMHA sink dispatch when init_sink_value is set (#7530) ## Summary - Fix `traits.has_sink` in `fmha_fwd_runner.hpp` to also check `init_sink_value != 0`, so the GPU kernel dispatches with sink support when `-init_sink=1` is passed. - Gate `run_sink_mask_tests` (StreamLLM) and `run_sink_init_tests` (GPT-OSS) behind opt-in flags `-m` and `-g` in `smoke_test_fwd.sh`. These tests require sink=true kernel instances which are excluded by the `BUILD_TESTING` CMake filter (`_nsink`), causing unconditional "not supported yet" failures (48 tests in CI). The opt-in flag approach was borrowed from PR #6057. ## Why gate tests instead of compiling sink=true kernels? The `BUILD_TESTING` filter in `CMakeLists.txt` uses `_nsink` glob patterns for the `fwd` and `fwd_splitkv` APIs, excluding sink=true kernel instances from compilation. We chose opt-in flags over widening the filter because: - Compile time: Enabling sink=true kernels doubles the kernel variants for `fwd` and `fwd_splitkv` APIs. The filter exists specifically to reduce CI build times. - Incremental enablement: Sink support (StreamLLM / GPT-OSS) is still maturing. Gating lets teams opt in explicitly (`smoke_test_fwd.sh -g`) while keeping the default CI path fast. - Precedent: splitkv (`-s`) and appendkv (`-a`) tests already follow this opt-in pattern. ## Test plan - [ ] Run `smoke_test_fwd.sh -g` with sink=true kernels compiled and verify sink-enabled kernels are dispatched - [ ] Verify `smoke_test_fwd.sh` still passes without `-m` / `-g` flags - [ ] Confirm CI no longer fails on sink tests (they are now opt-in)	2026-05-18 16:10:30 +00:00
Gino Lu	0f8b58ac88	sparse_attn: R25 Step 1 A1 — per-warp PV-skip (paper Algorithm 1) + V0 instantiation Preserve the R25 Step 1 "A1 / redesign D" state before redesigning toward "B" (per-CTA PV-skip matching upstream shipped reference). This snapshot lets us restore A1 if the B redesign fails. A1 redesign D pipeline (per-warp, arithmetic-only PV-skip, wrapped in `if constexpr (kEnablePVSkip)`): - include/ck_tile/ops/sparse_attn/pipeline/block_fmha_pipeline_qr_ks_vs_async_sparge.hpp - include/ck_tile/ops/sparse_attn/kernel/fmha_fwd_sparge_kernel.hpp V0 instantiation wiring (per gino_tmp/R25/programmer/v0_instance/REPORT.md): - example/ck_tile/50_sparse_attn/codegen/ops/fmha_fwd_sparge.py - example/ck_tile/50_sparse_attn/fmha_fwd_trek.hpp - example/ck_tile/50_sparse_attn/sparge_blockmap_trek.hpp - example/ck_tile/50_sparse_attn/sparge_blockmap_inst.cpp - example/ck_tile/50_sparse_attn/codegen/cpp_symbol_map.py - example/ck_tile/50_sparse_attn/CMakeLists.txt - example/ck_tile/01_fmha/CMakeLists.txt - example/ck_tile/50_sparse_attn/test_sparge.cpp (-pv_skip_compile=0\|1 CLI) This commit excludes all *_REVIEW.{hpp,cpp} mirror files (left untracked) and all build artefacts. _vsa.hpp / _jenga.hpp are not modified. Tag: R25-step1-A1-paper-aligned points at this commit.	2026-05-18 06:13:38 -04:00
Vidyasagar Ananthan	86591de476	[rocm-libraries] ROCm/rocm-libraries#5260 (commit a1834d2) [CK] [CK_Tile] Add FMHA scaffolding to CK kernel dispatcher (#5260) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation The CK Tile dispatcher currently supports GEMM and Grouped Convolution but has no support for Fused Multi-Head Attention (FMHA). The example/ck_tile/01_fmha folder contains a comprehensive FMHA implementation with forward, backward, split-KV, paged-KV, append-KV, and batch-prefill kernels across multiple GPU architectures — but there is no unified dispatch layer for it. This PR ports the FMHA stack into the dispatcher, following the same architectural patterns established by GEMM and Grouped Convolution, enabling runtime kernel selection, JIT compilation from Python, and a declarative C++ example flow. Autotuning heuristics to follow. ## Technical Details This PR adds FMHA scaffolding to the CK dispatcher framework, mirroring GEMM's layered architecture. Seven new C++ runtime headers provide type definitions (coexisting with upstream headers via __has_include, requiring zero modifications to example/ck_tile/01_fmha/), a problem builder with 18+ setters, Signature + Algorithm kernel key matching, a virtual kernel instance, a DECL_FMHA_KERNEL_SET macro with wildcard support and named tile/wave/warp setters, arch-aware registry with JSON export, and a dispatcher with seqtune-aware selection, configurable timing, and multi-stage execution plans for split-KV (two-stage) and backward (three-stage). The codegen pipeline is driven by a fmha_arch_specs.json capturing per-arch tile tables and pipeline constraints for five architectures (gfx90a/942/950/1100/1201), migrated from hardcoded logic in 01_fmha/codegen/, with supporting modules for C++ symbol mappings, validation rules, and named receipt profiles (ck_default, flash, pytorch, aiter, fp32, fp8). Python integration (fmha_utils.py) mirrors the C++ layer with JIT compilation, parallel multi-kernel builds, HIP memory management via ctypes, tolerance-based validation, and a NumPy CPU reference with GQA support. Twenty-seven C++ and thirty-two Python examples cover the full feature surface — forward, split-KV, masks, bias, dropout, GQA, backward, append-KV, batch prefill, fp8, logits soft cap, sink tokens, and parameter sweeps — all JIT-compiled on the fly. ## Test Plan Seven test files cover the runtime types, codegen, and end-to-end correctness. C++ unit tests validate the problem builder, dispatcher planning (single-stage for forward/paged-KV/append-KV; multi-stage for split-KV and backward), registry operations, and the kernel-set declaration macro. Python unit tests verify codegen emission, profile filtering, and 15 validation rules for masks, hdim constraints, and pipeline requirements. GPU execution validation in 01_basic_fmha --validate reports zero errors across 65,536 elements with max absolute error of 7.29e-05. A gold-standard parity suite (test_fmha_parity.py) runs 14 configurations through both the upstream tile_example_fmha_fwd and the dispatcher, comparing exit codes to confirm behavioral parity — all 14 match. ## Test Result The C++ smoke test builds and passes all 9 compiled examples, and a Python JIT sweep (29_sweep_seqlen.py) passes 7/7 configurations reaching up to 375 TFLOPS at seqlen 2048. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-17 07:30:33 +00:00
Gino Lu	840b8a37d9	test(sparse_attn): CPU-ref cross-check + BLKQ cite Wire SpargeAttn CPU reference into test_sparge: build the block_map on host via sparge::build_block_map_meansim and cross-check against the GPU-produced map; self-check the VSA delta-LUT (valid count + reachable kb indices); split PASS/FAIL into separate block_map / LUT / attention-output lines for clearer diagnosis. Set sparge_tool::SpargeParams::BLKQ default to 64 to match SpargeAttn SM90 convention (cite upstream qk_int_sv_f8_cuda_sm90.cu:143-144); tighten bf16 tolerance back to the dense FMHA baseline (4e-2 atol, 1e-2 rtol). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-05-17 02:35:51 -04:00
Gino Lu	879d50836e	cleanup(sparse_attn): R-tag rename + clang-format sweep Strip internal R-tag / phase labels (R20, R21A/B, Round 8/13f, Track F, B2.v3, Phase 1/2/3) from comments — replace with descriptive names so future readers don't need the change-log. Reflow long signature in fmha_fwd_trek.hpp. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-05-17 02:35:07 -04:00
Gino Lu	7103eacc99	refactor(sparse_attn): caller-owned workspace + dtype-aware sizing Replace process-lifetime lazy hipMalloc K-stats workspace with a caller-owned buffer; expose sparge_blockmap_get_workspace_size() / compute_workspace_layout() host helpers. Split the combined sparge_blockmap_fwd into stage launchers (sparge_kstats_fwd_oneshot + sparge_blockmap_only_fwd_oneshot) so the chained launch is timed end-to-end. Make pooled_k storage dtype follow KDataType (fp16/bf16) instead of fp32 to halve workspace footprint and match dense-FMHA precision. Tighten per-head superparam pointers to required (non-null) and assert N_k <= 256 in jenga MakeKargs to document the 256-bool LDS staging cap. Drop the obsolete VSA extra-LDS staging. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-05-17 02:34:23 -04:00
Gino Lu	668e107282	fix(sparse_attn): backport PR #4742 LDS s_barrier Add s_barrier after sched_barrier when K-tail and V share LDS buffer, mirroring upstream PR #4742. Applies to both async_vsa and async_jenga pipelines. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-05-17 02:30:48 -04:00
Jobbins	61b019f2a2	[rocm-libraries] ROCm/rocm-libraries#6961 (commit 47e8768) [CK] print hostname and $NODE_NAME to find inconsistencies (#6961) ## Motivation We suspect that the check for amdgpu: `cat /sys/module/amdgpu/version` sometimes gets ran on the Jenkins controller instead of the node. This adds the `hostname` command to compare to the $NODE_NAME variable. ## Technical Details Updated Jenkinsfile to include the `hostname` command. ## Test Plan ## Test Result ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-15 21:50:09 +00:00
John Shumway	3e110e1718	[rocm-libraries] ROCm/rocm-libraries#7114 (commit ecef372) [CK] Add rocm_ck foundation types: DataType, Layout, Args, Ops (#7114) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary - Add the vocabulary types that all rocm_ck schema headers build on - 9 new headers under `include/rocm_ck/`, 6 unit test files - Pure C++20, host-only — no CK Tile dependencies Headers: \| Header \| Purpose \| \|--------\|---------\| \| `index_t.hpp` \| `index_t`, `long_index_t` (matches ck_tile) \| \| `gpu_target.hpp` \| `GpuTarget` enum (ISA targets) \| \| `datatype.hpp` \| `DataType` enum (17 variants) \| \| `layout.hpp` \| `Layout` enum (Row, Col, Auto) + stride helpers \| \| `fixed_string.hpp` \| `FixedString<N>` — structural string for NTTPs \| \| `args.hpp` \| Generic kernel argument buffer (ABI) \| \| `ops.hpp` \| Operator structs (`GemmOp`, `AddOp`, ...) + `Op` variant \| \| `physical_tensor.hpp` \| `PhysicalTensor` — maps names to Args slots \| \| `resolved_tensor.hpp` \| `ResolvedTensor` — output of `Signature::resolve()` \| Stack: This is PR 1 of 3 porting the rocm_ck constexpr schema from experimental to production, #7143. 1. This PR — Foundation types (vocabulary) 2. Schema engine — `Signature`, `resolve()`, `ArchProperties` 3. Spec factories — `GemmSpec`, `ElementwiseSpec`, `makeSpec()` ## Test plan - [ ] `ninja build-smoke-rocm-ck` builds all tests - [ ] `ctest -L ROCM_CK_SMOKE --output-on-failure` — 6 unit tests pass (86 test cases) - [ ] Default CK build (`CK_ENABLE_ROCM_CK=OFF`) unaffected 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-05-15 19:22:44 +00:00
Illia Silin	187ef8ac94	[rocm-libraries] ROCm/rocm-libraries#7471 (commit 13b9eec) [CK] increase timeout limit for fmha_fwd tests to avoid CI failure on gfx11 (#7471) ## Motivation This should prevent fmha_fwd tests from timing out on one of the slower gfx11 CI nodes and generating false CI failures. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-15 16:42:53 +00:00
peterjunpark	fcece1e838	[rocm-libraries] ROCm/rocm-libraries#7380 (commit 50b369d) docs(composablekernel): update install pages for 7.13 ## Motivation Update instructions to install prebuilt packages. Generally, point to ROCm Core SDK install instructions. List a more granular package if useful to users. Preview: https://rocm.docs.amd.com/projects/composable_kernel/en/users-peterjunpark-ck-7.13-install/install/Composable-Kernel-install.html >[!NOTE] >Some links appear as plain text in the preview b/c the target page isn't publicly accessible yet. Related to: - [x] composablekernel https://github.com/ROCm/rocm-libraries/pull/7380 - [x] hipblas https://github.com/ROCm/rocm-libraries/pull/7378 - [x] hipblaslt https://github.com/ROCm/rocm-libraries/pull/7379 - [x] hipcub https://github.com/ROCm/rocm-libraries/pull/7377 - [x] hipdnn https://github.com/ROCm/rocm-libraries/pull/7376 - [x] hipfft https://github.com/ROCm/rocm-libraries/pull/7375 - [x] hiprand https://github.com/ROCm/rocm-libraries/pull/7374 - [x] hipsolver https://github.com/ROCm/rocm-libraries/pull/7371 - [x] hipsparse https://github.com/ROCm/rocm-libraries/pull/7373 - [x] hipsparselt https://github.com/ROCm/rocm-libraries/pull/7372 - [x] miopen https://github.com/ROCm/rocm-libraries/pull/7370 - [x] rocblas https://github.com/ROCm/rocm-libraries/pull/7369 - [x] rocfft https://github.com/ROCm/rocm-libraries/pull/7368 - [x] rocprim https://github.com/ROCm/rocm-libraries/pull/7367 - [x] rocrand https://github.com/ROCm/rocm-libraries/pull/7366 - [x] rocsolver https://github.com/ROCm/rocm-libraries/pull/7364 - [x] rocsparse https://github.com/ROCm/rocm-libraries/pull/7365 - [x] rocthrust https://github.com/ROCm/rocm-libraries/pull/7363 - [x] rocwmma https://github.com/ROCm/rocm-libraries/pull/7362 <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## JIRA ID <!-- If applicable, mention the JIRA ID resolved by this PR (Example: Resolves SWDEV-12345). --> <!-- Do not post any JIRA links here. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-15 14:58:51 +00:00
Bartłomiej Kocot	945849b0f5	[rocm-libraries] ROCm/rocm-libraries#6838 (commit ff7a665) [CK_TILE] Add depthwise conv2d forward kernel (FP16/FP32) (#6838) ## Motivation CK currently has no kernel optimized for depthwise convolution (G=C_in=C_out, C=K=1 per group) and existing generic paths perform poorly for this workload. This PR adds a dedicated depthwise conv forward kernel in CK Tile. ## Technical Details Adds a dedicated depthwise conv2d forward op to CK Tile that performs direct convolution rather than falling back to the generic GEMM path. The kernel is templatized by filter size, stride, and data type, and compiled into ~60 instances covering common configurations (kernel 3/5/7/9, stride 1/2, FP16/FP32). Supports both CDNA (gfx942/gfx950) and RDNA (gfx1100/gfx1200) architectures. ## Test Plan - [x] Correctness and performance validated on gfx942, gfx950, and gfx1100, with ckProfiler `grouped_conv_fwd` as baseline. - [ ] MI300A (gfx942) and gfx1200 validation. ## Submission Checklist - [x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-1137	2026-05-15 13:48:51 +00:00
Yaswanth Raparti	fe2e29fa68	[rocm-libraries] ROCm/rocm-libraries#7289 (commit e3fb4ee) [CK] Fix smart build false positives from merged commits (#7289) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Current smart-build infrastructure triggers full build for almost every PR which is draining our CI infrastructure. Need to update the test selection logic based on diffs from the current workspace instead of entire repo. ## Technical Details Use three-dot syntax and scope BUILD_INFRA_PATTERN to composablekernel. Changes: - Switch from two-dot (..) to three-dot (...) in git diff - Three-dot shows only PR-specific changes - Excludes commits merged from develop (prevents false positives) - Scope BUILD_INFRA_PATTERN to projects/composablekernel/ paths only - Avoids triggering on other projects (hipblas, hipdnn, etc.) - Only composablekernel build infra changes trigger full build - Update both ci_safety_check.sh and validate_pr.sh ## Test Plan Test with PR 7112 and 7223 ## Test Result Impact: - PR 7112: Was 620 files (false positive) → Now 6 files (correct) - PR 7223: Was full build (false positive) → Now selective build (correct) ## Submission Checklist - [ x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-14 19:32:32 +00:00
John Shumway	6cd06382b3	[rocm-libraries] ROCm/rocm-libraries#7090 (commit 316fded) [CK] Add rocm_ck directory structure with feature flag (#7090) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary Adds initial rocm_ck directory structure, #7119. - Establishes production `rocm_ck/` directory at `composablekernel/rocm_ck/`, peer to `tile_engine/` and `dispatcher/` - Adds `CK_ENABLE_ROCM_CK` option (default OFF) as a CK-internal feature flag — no superbuild or TheRock changes needed - Creates `rocm_ck` INTERFACE library, `ck_tile_headers` target, GTest integration with builder-style convenience targets (`smoke-rocm-ck`, `check-rocm-ck`) - Adds Jenkins `RUN_ROCM_CK_TESTS` parameter for CI, following the `RUN_BUILDER_TESTS` pattern - README explains the constexpr schema model: host-device separation via constexpr data rather than template parameters, enabling multi-arch distribution through kpack archives ## Test plan - [x] `cmake -DCK_ENABLE_ROCM_CK=ON` configures without errors - [x] `ninja check-rocm-ck` passes (4 host-only index type tests) - [x] Default build (`CK_ENABLE_ROCM_CK=OFF`) is unaffected — no rocm_ck targets present - [x] Jenkins `RUN_ROCM_CK_TESTS=true` enables the flag and runs `check-rocm-ck` 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-05-14 18:52:38 +00:00
Meekail Zain	d931e8703d	[rocm-libraries] ROCm/rocm-libraries#6867 (commit 3cb0219) Added custom FMHA codegen receipt for TransformerEngine (#6867) ## Motivation TE uses AITER to build static MHA libraries, which ultimately rely on CK kernels. We use the `600` receipt which generates more kernels than TE truly needs. This bespoke receipt allows us to minimize the kernel count, compile time, and memory footprint of our MHA library. ## Technical Details Extended the receipt mechanism to include a custom `700` receipt for TE's needs ## Test Plan Test by building TE using the same receipt profile ## Test Result Build validated in TE using a custom feature branches of AITER/CK to temporarily apply the patch ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-14 14:34:03 +00:00
Yi DING	83566edb0f	[rocm-libraries] ROCm/rocm-libraries#7331 (commit 5692db0) [CK_TILE] Add async workspace prepare to FMHA BWD launcher (#7331) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation `aiter::mha_bwd` in group mode currently issues two synchronous `hipMemcpy` D2H copies to read `seqstart_q/k` for launcher construction. These sync copies block the host (~10–30 µs each) and implicitly synchronize the device by draining the stream, breaking CPU/GPU overlap on hot training paths. This PR adds a fully stream-async workspace preparation path on the FMHA BWD launcher so callers can pre-allocate the device workspace from upper-bound shapes and stage seqstart-dependent metadata via D2H/host-pack/H2D entirely on the user's stream. ## Technical Details - `FmhaBwdWorkspaceManager::GetWorkspaceDeviceSizeUpperBound` (`include/ck_tile/ops/fmha/kernel/fmha_bwd_kernel.hpp`): computes the worst-case device dq_acc size from `(max_batch, hdim_q, nhead_q, max_seqlen_q, max_seqlen_k)` without dereferencing any seqstart array. Mirrors `PrepareWorkspaceHost`'s return value with worst-case bounds. - `fmha_bwd_launcher::prepare_workspace_async` (`example/ck_tile/01_fmha/fmha_bwd.hpp`): on the caller's stream, in order: 1. `hipMemsetAsync` of the dq_acc region (when `NeedsZeroDqAcc()`) 2. group mode: `hipMemcpyAsync` D2H of `seqstart_q/k` into a pinned host staging buffer 3. `hipLaunchHostFunc` runs `PrepareWorkspaceHost` on the pinned buffer 4. `hipMemcpyAsync` H2D of the packed metadata into `device_ws_ptr` The pinned staging buffer is held via `std::shared_ptr<void>` returned by a caller-provided `pinned_host_alloc` callback. Lifetime is extended past stream completion by a tail `hipLaunchHostFunc` scheduled in the launcher's destructor. - `ck_tile::pinned_host_releaser` (`include/ck_tile/host/pinned_host_releaser.hpp`): worker-thread utility for callers using bare `hipHostMalloc`. Defers `hipHostFree` off the HIP driver callback thread, which holds runtime locks and would deadlock against concurrent main-thread `hipFree`. PyTorch's `CachingHostAllocator` does not need this. - Example runner (`example/ck_tile/01_fmha/fmha_bwd_runner.hpp`): switched to the async path. ## Test Plan - `tile_example_fmha_bwd` (gfx950, dev preset `-Werror -Weverything`): - batch + nondet / batch + det / group + nondet / group + det - group + det 4-batch varlen (`-b=4 -h=8 -s=4096,3072,2048,1024 -d=128`) - FA (`flash-attention`) integration on ROCm 7.1.1 + PyTorch 2.9.1: - `tests/test_flash_attn_ck.py::test_flash_attn_varlen_deterministic` - `tests/test_flash_attn_ck.py::test_flash_attn_bwd_varlen_seqq_zero` ## Test Result - All CK runner cases `valid:y`. - FA pytest: 1952 passed in 44.82s. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-14 13:34:32 +00:00
Copilot	4d852e80fb	[rocm-libraries] ROCm/rocm-libraries#6983 (commit f4e9a84) Remove batch_prefill from FMHA_FWD_KNOWN_APIS Remove `batch_prefill` from the `FMHA_FWD_KNOWN_APIS` list in `projects/composablekernel/example/ck_tile/01_fmha/CMakeLists.txt`. Change: ```cmake # Before set(FMHA_FWD_KNOWN_APIS "fwd;fwd_splitkv;fwd_appendkv;pagedkv_prefill;batch_prefill") # After set(FMHA_FWD_KNOWN_APIS "fwd;fwd_splitkv;fwd_appendkv;pagedkv_prefill") ``` Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: asleepzzz <4926646+asleepzzz@users.noreply.github.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2026-05-14 12:42:14 +00:00
Qianfeng	acf3d65966	[rocm-libraries] ROCm/rocm-libraries#7256 (commit 1fc20eb) =?UTF-8?q?Skip=20numeric=20drop-out=20when=20PComputeWind?= =?UTF-8?q?ow=20is=20a=20null=5Ftile=5Fwindow=20in=20Bl=E2=80=A6=20(#7256)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The BlockDropout implementation already provides very complete logic for generating random numbers and executing dropout for the P tensor after first attention Gemm with capability to support both Warp-Gemm 32x32 and 16x16 as well as to run on both wave32 and wave64 arch. But in some situation, we only need the block-layer process to generate random numbers, rather than simultaneously execute dropout in real-time on the vgpr tile. For example, xformers' `test_mem_eff_attention.py::test_dropout_ck` requires the host reference implementation of `attention forward with dropout` to use the same random numbers to compare & verify the device side implementation of `attention forward with dropout`, so a standalone kernel to generate random numbers only is required. This PR will enable xformers's random_val generating kernel (in file `ck_tiled_rand_uniform_kernel.h`) to depend on BlockDropout's `Run()` operator completely to generate random numbers for a `[MPerBlock, NPerBlock]` tile during the tile iteration, no need to replicate the logic of BlockDropout in the xformers kernel	2026-05-13 09:42:28 +00:00
Linjun-AMD	5c7b7ec3f1	[rocm-libraries] ROCm/rocm-libraries#7272 (commit d02f3c0) [ck_tile][fmha_bwd] Fix sink_host OOB in group mode reference runner (#7272) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary In `fmha_bwd_runner.hpp`, the `sink_host` `HostTensor` is allocated with first dimension `shape_batch` (= 1 in group mode), but the reference forward loop accesses `sink_host(wb, i_h)` with `wb ∈ [0, batch-1]`. For any `wb >= 1` this is an out-of-bounds heap read, silently corrupting the reference forward math chain (`lse_host`, `o_host`) and turning the bwd-side `d_sink_head_acc` reference into non-deterministic garbage. `HostTensor::operator()` does not bounds check, so the OOB is not caught at runtime. This manifests as intermittent `tile_example_fmha_bwd` failures (25–67% fail rate) when `-sink_grad=1` is combined with `-mode=1` (group mode), with bit-exact but spurious `max_err` values like 4.27 / 14.6. ## Fix One-line: allocate `sink_host` with `batch` (the real per-batch dim) instead of `shape_batch`, mirroring how `sink_host` is accessed by the loop. ```diff - sink_grad ? std::array<ck_tile::index_t, 2>{shape_batch, nhead} + sink_grad ? std::array<ck_tile::index_t, 2>{batch, nhead} Repro tile_example_fmha_bwd -b=2 -h=2 -s=516 -s_k=253 -prec=bf16 -d=72 \ -bias=n -dbias=0 -p_drop=0 -iperm=1 -operm=1 -deterministic=0 \ -v=3 -mode=1 -kname=1 -sink_grad=1 Verification - 0/30 fail on the repro config after fix - Baselines (before fix): - sink=1, mask=n: 25% fail rate (p ≈ 1.8e-4) - sink=1, mask=t: 67% fail rate (p ≈ 6e-15) Attribution Shape bug introduced together with sink_grad in #5504. Unrelated to #6914 (which is a fwd-only fix on a different code path) ``` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-13 08:49:13 +00:00
Yaswanth Raparti	6989cf800c	[rocm-libraries] ROCm/rocm-libraries#6327 (commit 1e7a12e) [CK][CK TILE] Dispatcher kernel selection heuristic for grouped conv (#6327) ## Motivation The ML heuristic in dispatcher does not support grouped-conv operator yet. In this PR, the support for fwd, bdw-data, and bwd-weight grouped-conv kernels have been added. A tile_engine utility has also been added to compile and run any selected kernel configuration through dispatcher infrastructure. ## Technical Details 1. Tile engine utility is added to benchmark each shape with all the possible kernel+tile_size combinations here - [https://github.com/ROCm/rocm-libraries/blob/users/yraparti/ck/dispatcher-grouped-conv-heuristics/projects/composablekernel/tile_engine/ops/grouped_conv/grouped_conv_full_benchmark.py](url) 2. New LGBM regressor models for grouped conv are added to models directory. We have 3 separate models for fwd, bwd-data, and bwd-weights [https://github.com/ROCm/rocm-libraries/tree/users/yraparti/ck/dispatcher-grouped-conv-heuristics/projects/composablekernel/dispatcher/heuristics/models](url) 3. Implemented lazy GPU initialization (dispatcher/python) - Issue: ProcessPoolExecutor fork() + GPU context caused memory access faults - Solution: Mirror FMHA pattern - defer GPU initialization until first run() - Changes: - setup_multiple_grouped_conv_dispatchers() returns List[Path], not loaded libs - GpuGroupedConvRunner.__init__() no longer calls ctypes.CDLL - Added _ensure_initialized() method for lazy GPU loading - GPU context created only on first run() call - Benefit: Parallel compilation now works without GPU conflicts 4. Addressed few miscellaneous issues such as: - Fixed BF16->FP16 naming bug in the dispatcher wrapper - Added new tile sizes, and comp_v5 pipeline to the arch spec to expand the kernel selection - Added automatic padding support for unsupported shapes in dispatcher runner - Created a single source of truth between tile_engine and dispatcher about the architecture and tile_size details - Build a validation scripts to compare oracle_best vs ml_heuristic comparison ## Test Plan 1. Validated fwd, bwd-data, and bwd-weight kernels with both known and unseen data sets with up to 300 problems. 2. Ensured that test cases are added in both dispatcher and tile_engine to validate the heuristic. ## Test Result Results on Unseen shapes validated on gfx950 #### Forward Pass Model - Training Data: 48,845 measurements across 1,372 unique problem shapes - Validation Set: 300 unseen problems from model crawler - Validation Performance (vs. oracle): - Mean Efficiency: 93.05% - Median Efficiency: 96.8% - P10 Efficiency: 79.9% #### Backward Data Gradient (bwd_data) Model - Training Data: 18,773 measurements across 891 unique problem shapes - Validation Set: 300 unseen problems from model crawler - Validation Performance (vs. oracle): - Mean Efficiency: 93.8% - Median Efficiency: 96.5% - P10 Efficiency: 82.9% #### Backward Weight Gradient (bwd_weight) Model - Training Data: 34,900 measurements across 1,508 unique problem shapes - Validation Set: 300 unseen problems from model crawler - Validation Performance (vs. oracle): - Mean Efficiency: 96.1% - Median Efficiency: 99.2% - P10 Efficiency: 89.4% ## Submission Checklist - [ x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-08 20:48:42 +00:00
Illia Silin	b05040b919	[rocm-libraries] ROCm/rocm-libraries#7111 (commit 651947f) [CK] Fix latest batch of staging compiler warnings ## Motivation Suppress the new batch of clang lifetimebound and invalidation warnings with the latest staging compiler. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-08 14:15:31 +00:00
Yi DING	41064d8684	[rocm-libraries] ROCm/rocm-libraries#7141 (commit 37e40c3) [CK_TILE] Fix typo in fmha_fwd_kernel K-dram unmerge tuple sizes (#7141) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary The qr_async_trload K-dram lambda's `else (XorLengthFold == 1)` branch in `fmha_fwd_kernel.hpp` writes the outer-tile dim of its 3-tuple unmerge/xor/merge as ```cpp number<FmhaPipeline::kQKHeaddim / kDramTileK / FmhaPipeline::kAlignmentK>{} ``` which divides one extra time. For every fp16/bf16 hdim=128 configuration the outer length collapses to 0, e.g. `128 / 128 / 8 == 0`. The 3-tuple product no longer equals `kQKHeaddim`, so unmerge → xor → merge stops round-tripping the head dimension. This bug was masked by the async-load path: it only walks the descriptor via stride and silently absorbs a length=0 outer dim. Any consumer that actually traverses the descriptor (e.g. the TDM path on gfx1250) immediately faults on the resulting `tuple<int, constant<0>>`. The fix drops the extra `/ kAlignmentK` in all three call sites in the same lambda so the outer dim becomes `kQKHeaddim / kDramTileK` and the product is restored to `kQKHeaddim`. Strides are unaffected, so the async path is bit-identical. \| Config (fp16/bf16) \| hdim \| kDramTileK \| kAlignmentK \| a (typo) \| a (fixed) \| product (typo) \| product (fixed) \| \|---\|---\|---\|---\|---\|---\|---\|---\| \| hdim128, kKLoadOnce \| 128 \| 128 \| 8 \| 0 \| 1 \| 0 \| 128 \| \| hdim128, kK0=32 \| 128 \| 32 \| 8 \| 0 \| 4 \| 0 \| 128 \| \| hdim64, kKLoadOnce \| 64 \| 64 \| 8 \| 0 \| 1 \| 0 \| 64 \| \| hdim256, kK0=32 \| 256 \| 32 \| 8 \| 1 \| 8 \| 32 \| 256 \| Bug introduced in 2cc0af6a815a (PR #2888 \"[CK_TILE] FMHA FWD bug fix\"), where the original 2-tuple unmerge was generalized to a 3-tuple and the typo slipped in. ## Test plan - [x] Built `test_ck_tile_fmha_fwd` (umbrella, 5 gtest binaries) on gfx950 native at develop b3bdc63a509 with `dev-gfx950` preset (clang 22, ROCm 7.2.2). Compiles cleanly with `-Werror -Weverything`. - [x] Ran `ctest -R test_ck_tile_fmha_fwd` on gfx950 native, baseline vs patched: identical pass/fail (3 pass / 2 fail), identical failing case set (114 gtest fails + 2 GPU memory access faults, all in pre-existing fp16/bf16 group-mode `Alibi`/`Dropout` cases that reproduce on develop without this patch). Total wall time 403s → 393s. Per-case latency drift ±8% (noise). - [x] CI to verify on other gfx9 / gfx11 architectures.	2026-05-08 08:51:33 +00:00
Illia Silin	cc29502c28	[rocm-libraries] ROCm/rocm-libraries#6933 (commit ac8b7d9) [CK] Filter out unsupported targets. ## Motivation Filter out any unsupported targets, e.g., gfx900, gfx906, gfx90c, from the GPU_TARGETS or GPU_ARCHS lists. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-08 03:24:33 +00:00
Illia Silin	837fb379d9	[rocm-libraries] ROCm/rocm-libraries#7138 (commit 70e6660) [CK] disable tile_engine by default, limit gfx1030 CI builds to develop only. (#7138) ## Motivation An attempt to reduce the build time and keep CI moving faster. Disable tile_engine by default since even the cmake step may take up to 30 minutes. Since we're down to a single gfx1030 CI node, use it only for develop builds. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-08 01:41:00 +00:00
Chao	1d1be9e3de	[rocm-libraries] ROCm/rocm-libraries#6529 (commit 93a6097) [CK_TILE] Enable V3 persistent kernel dispatch for FMHA forward on gfx950 (#6529) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit [CK_TILE] Enable V3 persistent kernel dispatch for FMHA forward on gfx950 ## Motivation Enable the existing V3 persistent kernel path for CK-Tile FMHA forward on gfx950 (MI350X/MI355X). The V3 kernel and codegen infrastructure already exist but are disabled via hardcoded `F_is_v3_enabled=False`. This change replaces the compile-time gate with a runtime environment variable `CK_FMHA_ENABLE_V3=1` (disabled by default, opt-in). When enabled: - Prefill workloads (seqlen_q > 1) dispatch to V3 persistent pipeline - Decode workloads (seqlen_q == 1) always use V2 (memory-bound, better suited) The V3 persistent kernel uses grid-stride scheduling, XCD-interleave tile assignment for L2 locality, LPT reversal for causal masks, and gfx950 async buffer loads. ## Technical Details Single file: `example/ck_tile/01_fmha/codegen/ops/fmha_fwd.py` - Add `#include <cstdlib>` and `<string>` for `std::getenv` - Replace `{F_is_v3_enabled}` template parameter with runtime env var check - Add `seqlen_q > 1` guard (decode always uses V2) - Remove `.format()` call in `write_fwd_api()` ## Dependencies Depends on https://github.com/ROCm/rocm-libraries/pull/6501 — builds on XCD-interleave and LPT scheduling infrastructure. ## Test Plan - GPU validation on MI300X (gfx942, ROCm 6.4.1): - Command: `./build/bin/tile_example_fmha_fwd -b=2 -h=8 -s=4096 -d=128 -prec=bf16 -v=1 -warmup=1 -repeat=3` - GPU validation on MI350X (gfx950, ROCm 7.0): - Command (V2): `./build/bin/tile_example_fmha_fwd -b=2 -h=8 -s=4096 -d=128 -prec=bf16 -v=1 -warmup=1 -repeat=3` - Command (V3): `CK_FMHA_ENABLE_V3=1 ./build/bin/tile_example_fmha_fwd -b=2 -h=8 -s=4096 -d=128 -prec=bf16 -v=1 -warmup=1 -repeat=3` - Command (decode, always V2): `./build/bin/tile_example_fmha_fwd -b=64 -h=32 -h_k=8 -s=1 -s_k=4096 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1 -repeat=3` ## Test Result Benchmark results (MI350X, gfx950, ROCm 7.0): \| Config \| V2 (TFlops) \| V3 (TFlops) \| Speedup \| \|--------\|-------------\|-------------\|---------\| \| Non-causal b=2 h=8 hk=2 s=4096 d=128 bf16 \| 696.3 \| 884.2 \| +27.0% \| \| Causal b=2 h=8 hk=2 s=4096 d=128 bf16 \| 371.3 \| 494.9 \| +33.3% \| \| GQA b=2 h=32 hk=8 s=2048 d=128 bf16 \| 671.3 \| 831.7 \| +23.9% \| \| LLaMA-70B b=1 h=64 hk=8 s=4096 d=128 bf16 \| 761.5 \| 927.3 \| +21.8% \| \| Causal GQA b=2 h=32 hk=8 s=2048 d=128 bf16 \| 345.4 \| 631.9 \| +82.9% \| \| Long-seq b=1 h=16 s=16384 d=128 bf16 \| 797.8 \| 969.9 \| +21.6% \| \| Decode b=64 h=32 hk=8 s=1 s_k=4096 bf16 \| 1828 GB/s \| — (V2 path) \| unaffected \| Benchmark results (MI300X, gfx942, ROCm 6.4.1): V3 has 0% effect on MI300X — V3 relies on gfx950 async buffer loads and falls back to the V2 code path on gfx942. No regression on any config. \| Config \| TFlops / GB/s \| Time (ms) \| Delta vs baseline \| \|--------\|-------------\|-----------\|-------------------\| \| MHA bf16 b=2 h=8 s=4096 d=128 \| 342.98 TFlops \| 0.401 \| +0.1% \| \| MHA fp16 b=2 h=8 s=4096 d=128 \| 411.18 TFlops \| 0.334 \| +4.9% \| \| Causal MHA bf16 b=2 h=8 s=4096 d=128 \| 232.61 TFlops \| 0.296 \| +2.4% \| \| GQA 4:1 bf16 b=2 h=32 hk=8 s=2048 d=128 \| 320.07 TFlops \| 0.429 \| -1.4% \| \| GQA 8:1 bf16 b=2 h=64 hk=8 s=2048 d=128 \| 353.91 TFlops \| 0.777 \| +1.7% \| \| LLaMA-70B prefill b=1 h=64 hk=8 s=4096 d=128 bf16 \| 381.53 TFlops \| 1.441 \| +1.2% \| \| Long-seq bf16 b=1 h=16 s=16384 d=128 \| 388.61 TFlops \| 5.659 \| +1.4% \| \| Decode b=64 h=32 hk=8 s_k=4096 d=128 bf16 \| 693.40 GB/s \| 1.550 \| +0.3% \| All validation tests pass (`valid:y`) on both MI300X and MI350X. Additional validation: - `CK_FMHA_ENABLE_V3=0` correctly falls back to V2 (default behavior unchanged) - `CK_FMHA_ENABLE_V3=1` dispatches to V3 for prefill, V2 for decode - Validation passes across fp16/bf16, batch/group mode, causal/non-causal - No regression on decode path	2026-05-07 16:23:19 +00:00
Linjun-AMD	33b62ed087	[rocm-libraries] ROCm/rocm-libraries#6914 (commit b791478) [CK_TILE][FMHA] Fix sink un-mask under right-window and emit fp8bf16 batch_prefill sink kernels (#6914) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary Two related fixes to `ck_tile` FMHA so that StreamLLM-sink + sliding-window batch-prefill works correctly for fp8 KV / bf16 compute. Review the commits in this order: 1. `fmha: emit sink kernels for fp8bf16 batch_prefill` Extends `example/ck_tile/01_fmha/codegen/ops/fmha_batch_prefill.py` so the fp8(KV) / bf16(QO) batch-prefill codegen also emits the `mask=mask_enum::generic_with_sink` variant. Without this the runtime could not dispatch to a sink-aware kernel for the fp8bf16 path. 2. `fmha: respect right-window in IsOutOfSinkBound` The sink un-mask in `GenericAttentionMask::IsOutOfSinkBound` (local-mask branch) used `(i_y + x) > 1` as the gate, which conditioned on the row index instead of the column index. As a result, queries `1..sink-1` could attend to future sink positions (violating causal / right-window), while query `0` fell back to the plain causal mask. The fix replaces the guard with `i_x < i_y + x` so every query only sees sink columns up to its own right-window boundary. 3. `fmha: clarify IsOutOfSinkBound predicate comment` Doc-only follow-up that rewrites the comment above the predicate as a clause-by-clause explanation (`i_x < sink`, `i_x < i_y + x`, `y < y_total`, `i_y < x_total`). ## Test plan - [x] Repro on aiter `op_tests/test_batch_prefill.py` (fp8 + bf16_dequant modes with `sink=4`, `win_left=1023`, `softcap=0.0`, `sal=True`) now passes for all parametrized shapes. - [x] Existing fp16/bf16 batch-prefill paths (no sink) unchanged — codegen diff only adds the `generic_with_sink` variant for fp8bf16; existing kernel object lists unaffected. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-07 02:40:45 +00:00
Yi DING	207a95d5e4	[rocm-libraries] ROCm/rocm-libraries#6152 (commit 36b016a) [CK_TILE] Use Unified Workspace for FMHA BWD MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation `dq_acc` is the intermediate accumulation buffer used in FMHA backward pass for deterministic mode. The current implementation allocates it as a single rectangular tensor: ``` shape = [shape_batch, nhead, nsplits, shape_seqlen_q, hdim_q] ``` where `nsplits = launcher.dq_acc_splits` (a single scalar), computed from `max_seqlen_k` and shared across all batches. ### Problems 1. Memory waste: In group mode, each batch may have a different `seqlen_k`, but `nsplits` is computed from `max_seqlen_k`, causing batches with shorter `seqlen_k` to over-allocate in the split dimension. 2. Interface coupling: `fmha_bwd_args` exposes internal layout details such as `stride_dq_acc`, `nhead_stride_dq_acc`, `batch_stride_dq_acc`, and `split_stride_dq_acc`. The caller is responsible for computing these strides, but this logic belongs inside the kernel. ### Goals 1. Switch `dq_acc` buffer to a compact layout: batches are concatenated contiguously, with each batch occupying `nhead * nsplits_i * seqq_i * hdim_q` elements (nhead outermost). 2. *Remove all `_stride_dq_acc` fields from `fmha_bwd_args`, replacing them with a single `workspace_ptr`; the kernel splits this internally using a fixed layout. 4. `fmha_bwd_launcher` provides a workspace management interface: the caller only needs to allocate GPU memory and call `prepare_workspace()` — no layout computation required. 5. Isolate kernel internals from the caller API*: the `dq_acc` layout (nsplits, strides, buffer size) is determined entirely inside the launcher/kernel. Future changes to block shape, pipeline type, or persistent kernel strategy require no modifications to the caller's `fmha_bwd_args` or workspace allocation logic. ## Technical Details ### Interface Design #### New fields in `fmha_bwd_traits` ```cpp struct fmha_bwd_traits { int seqlen_q; int seqlen_k; int batch; int max_seqlen_q; int max_seqlen_k; int hdim_q; int hdim_v; int nhead_q; int nhead_k; std::string data_type; bool is_group_mode; mask_enum mask_type; bias_enum bias_type; bool has_dbias; bool has_dropout; bool is_store_randval; bool is_deterministic; // New: cumulative physical seqlen pointers for group mode (pass nullptr for batch mode). // seqstart_qs[i+1] - seqstart_qs[i] = physical seqlen_q of batch i (including padding); length = batch+1 // seqstart_ks[i+1] - seqstart_ks[i] = physical seqlen_k of batch i (including padding); length = batch+1 const int seqstart_qs = nullptr; const int* seqstart_ks = nullptr; }; ``` #### `fmha_bwd_launcher` actual structure ```cpp struct fmha_bwd_launcher { std::function<float(fmha_bwd_args, const ck_tile::stream_config&)> run{}; // Total workspace size in bytes (host_ws_size + device_ws_size), computed by init(). // Zero for kUseQrQtrDorPipeline (writes dq directly, no acc buffer needed). size_t workspace_size = 0; fmha_bwd_launcher(const fmha_bwd_traits&); // Copies auxiliary data (nsplits[], offsets[]) via hipMemcpy to the head of the GPU workspace, // and zeros the dq_acc buffer portion (tail of workspace) if required. // The memory pointed to by device_ws must be >= workspace_size bytes. std::function<void(void* device_ws)> prepare_workspace{}; template <typename... Args> float operator()(Args&&... args) const { return run(std::forward<Args>(args)...); } private: size_t host_ws_size = 0; // CPU workspace size (nsplits[] + offsets[] arrays) size_t device_ws_size = 0; // GPU-only data size (dq_acc buffer) std::unique_ptr<char[]> ws_host; // host-side workspace buffer public: template <typename T0, typename T1, typename T2, typename Arch> void init(const fmha_bwd_traits& traits); }; ``` The `init<>()` template method (invoked by codegen dispatch branches as `this->init<...>(t)`) is responsible for: 1. Setting the `run` lambda 2. Calling `FmhaBwdDQDKDVKernel::GetWorkspaceHostSize(batch)` to obtain `host_ws_size` 3. Allocating `ws_host` (host memory) 4. Calling `FmhaBwdDQDKDVKernel::PrepareWorkspaceHost(ws_host.get(), ...)` to fill nsplits/offsets; return value is `device_ws_size` 5. `workspace_size = host_ws_size + device_ws_size` 6. Setting the `prepare_workspace` lambda (captures `this`, calls `PrepareWorkspaceDevice`) When no kernel matches the given traits, both `run` and `prepare_workspace` are initialized to default lambdas that print a warning to `std::cerr` and return gracefully (no exception). #### Workspace overall layout The workspace is managed by `FmhaBwdWorkspaceManager` and consists of two segments: ``` Offset 0 (CPU-prepared segment, host_ws_size bytes; also hipMemcpy'd to the head of GPU workspace): index_t nsplits[batch or 1] — per-batch nsplits array group mode: batch elements batch mode / non-deterministic: 1 element [group mode only] long_index_t dq_acc_offsets[batch+1] — per-batch element offset (inclusive prefix sum) offsets[0]=0, offsets[i+1] = offsets[i] + nheadnsplits_iseqq_ihdim_q Offset host_ws_size (device data segment, device_ws_size bytes): AccDataType dq_acc[total_elements] — compact dq_acc buffer (zeroed if required) total_elements = sum_i(nhead nsplits_i * seqq_i * hdim_q) layout within each batch: [nhead, nsplits_i, seqq_i, hdim_q] note: seqq_i uses the physical length (including padding) ``` Alignment constant (`ALIGNMENT = 16`): ``` nsplits_size = align_up(sizeof(index_t) * N, 16) // N = batch (group) or 1 (batch/non-det) offsets_size = align_up(sizeof(long_index_t) * (batch+1), 16) // group mode only host_ws_size = nsplits_size + offsets_size dq_acc_offset = host_ws_size // GetDqAccDataOffset(batch) ``` Key benefits: - The kernel reads nsplits/offsets directly from the workspace head — no device-side recomputation. - `FmhaBwdConvertQGradKernel` is completely decoupled from the pipeline block shape (`kN0`): nsplits is read from `nsplits_ptr`, `kN0` is no longer a template parameter, and multiple dq_dk_dv tiles with different `F_bn0` values now share a single convert_dq kernel instance (under receipt 1/2, deterministic convert_dq kernel count drops from ~300 to 60). - nsplits/offsets are computed on the host and transferred in one `hipMemcpy`; the dq_acc buffer follows immediately, at the offset given by `GetDqAccDataOffset`. #### Workspace size by scenario \| Scenario \| `workspace_size` \| Notes \| \|----------\|-----------------\|-------\| \| kUseQrQtrDorPipeline (any mode) \| `0` \| Writes dq directly; no acc buffer; `PrepareWorkspaceHost` returns 0 \| \| Non-deterministic + batch mode \| `> 0` \| nsplits[1]=1; dq_acc used for atomic add; `workspace_size = host_ws_size + batchnheadseqlen_qhdim_qebytes` \| \| Non-deterministic + group mode \| `> 0` \| nsplits[1]=1; dq_acc contiguous layout; `workspace_size = host_ws_size + nheadseqstart_qs[batch]hdim_qebytes` \| \| Deterministic + group mode* \| `> 0` \| nsplits[batch], offsets[batch+1], compact dq_acc; nsplits_i computed independently per batch \| \| Deterministic + batch mode persistent \| `> 0` \| nsplits[1] (uniform across batches); dq_acc `batchnheadnsplitsseqlen_qhdim_q` \| NeedsZeroDqAcc (determines whether `PrepareWorkspaceDevice` calls `hipMemset`): - Persistent kernel (deterministic batch mode) or non-deterministic: must zero (atomic add requires zero initialization) - Deterministic group mode + no mask: no zeroing needed (every tile writes its full region) - Deterministic + with mask: must zero (some blocks are skipped, leaving uninitialized tiles that would contribute to the reduction) #### Caller usage ```cpp // 1. Create launcher (traits include seqstart_qs/ks pointers; workspace_size is computed during construction) fmha_bwd_launcher launcher(fmha_traits); // 2. Read launcher.workspace_size directly const auto ws_size = launcher.workspace_size; // 3. Allocate a single GPU workspace ck_tile::DeviceMem ws_buf(ws_size); // 4. Copy nsplits/offsets to GPU head and zero dq_acc if required launcher.prepare_workspace(ws_buf.GetDeviceBuffer()); // 5. Build args with a single workspace pointer; the kernel splits it internally fmha_bwd_args args{ ..., ws_size > 0 ? ws_buf.GetDeviceBuffer() : nullptr, // workspace_ptr }; launcher(args, stream_config); ```	2026-05-07 02:23:28 +00:00
Illia Silin	250c29f914	[rocm-libraries] ROCm/rocm-libraries#7046 (commit aaf7665) [CK] fix CI git token. ## Motivation Fix the CI breakage due to git PAT deprecation. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-06 02:32:37 +00:00
Gino Lu	b00e5449c8	sparse_attn: split KStats kernel, add README + perf charts - Split SpargeKStatsKernel/Pipeline out of BlockMap (Kernel A produces per-block K stats workspace consumed by Kernel B), removing redundant K-stat recomputation across Q-blocks. - Add example/ck_tile/50_sparse_attn/README.md (status vs upstream pinned to ae5b629, unported items, usage, references). - Add example/ck_tile/50_sparse_attn/docs/{speedup_vs_sparsity,kernel_breakdown}.png + reusable plot_sparge_perf.py (b=2 h=32 s=16384 d=128 fp16 perf snapshot). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>	2026-05-05 03:13:24 -04:00
Jeff Huang	10cb6916c3	[rocm-libraries] ROCm/rocm-libraries#6932 (commit ce3e67b) [CK] Fix OOB page table read in batch_prefill V prefetch (AICK-1171) (#6932) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary Fix a GPU memory access fault in `mha_batch_prefill` triggered when the per-batch page table is tightly sized (no trailing slack). Affected configurations: - All FMHA batch prefill V2 kernels (`block_fmha_batch_prefill_pipeline_qr_ks_vs_async`) - Triggered by paged KV layouts where `kv_page_indices.numel() == ceil(seqlen_k / page_size)` exactly - Manifests as: `Memory access fault by GPU node-X (Agent handle: 0x...)` followed by `Aborted (core dumped)` - Silent corruption (no fault, wrong output) when the OOB read happens to land in zero-initialized memory ### Root cause `load_physical_pages` performs lookahead reads on the page table to prefetch K/V tiles for the next iteration. When the page table for a batch has exactly `N` entries, the V-tile prefetch indexes `page_idx[N]` (one past the last valid entry), reading either uninitialized memory or the next batch's slot. On gfx942 with a tightly-sized page table, the read crosses into an unmapped page and triggers an HSA page fault. The bug was masked in earlier testing because most test harnesses pad `kv_page_indices` with trailing zeros — OOB reads then return `page_id = 0`, a valid in-cache page, producing silent numerical drift instead of a fault. ### Fix design Thread `max_page_table_idx = (seqlen_k - 1) / page_size` from the kernel layer down to `load_physical_pages`, and clamp every page-table read with `ck_tile::min()`. Applied to all four code paths in the V prefetch: \| Branch \| What it does \| Clamp applied \| \|--------\|-------------\|---------------\| \| `kIsKcache` \| K prefetch loop \| `min(global_token_idx >> kLog2PageSize, max_page_table_idx)` \| \| V LINEAR (`page_size == 1`) \| One token = one page \| `min(global_token_idx, max_page_table_idx)` \| \| V crosses pages (`kVTileCrossesPages`) \| Per-thread page lookup \| `min(global_token_idx >> kLog2PageSize, max_page_table_idx)` \| \| V single page (lane0 broadcast) \| `readfirstlane`-uniform lookup \| `min(... >> kLog2PageSize, max_page_table_idx)` \| ### Key design decisions Mandatory parameter, not optional with a sentinel default. An optional `max_page_table_idx = INT32_MAX` default would let the bug silently come back at any new callsite that forgets to pass it. Making it mandatory forces every caller to opt in explicitly and surfaces missed callsites at compile time. `seqlen_k == 0` clamps to 0 instead of underflowing `(0 - 1) / page_size` to `-1`. The empty-batch case is rare but well-defined: clamp every read to slot 0. Single computation in the kernel layer. `FmhaBatchPrefillWithPagedKVCacheKernel` computes `max_page_table_idx` once per batch and forwards it through every QScale branch (PERTENSOR / KV_BLOCKSCALE / default). All three `operator()` overloads of the pipeline (rich, default forwarder, KV_BLOCKSCALE forwarder) take and forward the parameter. ### Files changed \| File \| Change \| \|------\|--------\| \| `include/ck_tile/ops/fmha/kernel/fmha_batch_prefill_kernel.hpp` \| Compute `max_page_table_idx` per batch, forward to all 3 QScale branches \| \| `include/ck_tile/ops/fmha/pipeline/block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp` \| Add `max_page_table_idx` to `load_physical_pages` and 3 `operator()` overloads; clamp page-id reads in 4 code paths \| ## Test plan - [x] AICK-1171 reproducer verified on MI-308X (gfx942) - [x] New pytest case `test_batch_prefill_aick1171_oob_page_table_read` in aiter, parametrized over `total_blocks ∈ {160, 164, 168, 176, 208, 256}` (matches the `crash1_r8_*` bisect family) - [x] Full FMHA batch prefill suite on gfx942 + gfx950 ## Linked issue AICK-1171.	2026-05-05 06:29:51 +00:00
Yaswanth Raparti	af02240be8	[rocm-libraries] ROCm/rocm-libraries#6912 (commit c705da2) [CK] Reduce per-file logging in cmake_dependency_analyzer (#6912) ## Motivation Current progress_callback function generates large volume of prints which creates noise in seeing actual CI failure logs. Only emit a progress line at the completion of each stage to avoid massive logs from the per-source-file extracting_dependencies callback. ## Technical Details Update the `progress` function to print only at the completion of each stage. https://github.com/ROCm/rocm-libraries/pull/6912/changes#diff-15971b83c7dfefb48fd788507a923017d93bbd9487ed6aeb414ad2c5e00be934R720 ## Test Plan to be tested in CI ## Test Result to be tested in CI ## Submission Checklist - [x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-05-01 08:13:45 +00:00
John Shumway	f3d8a1269a	[rocm-libraries] ROCm/rocm-libraries#6972 (commit 8761b90) [CK] Dockerfile: auto-discover latest TheRock nightly tarball (#6972) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Our docker containers with `--build-arg compiler_version=therock` should have the latest nightly build of TheRock in `/opt/rocm`. When I looked for `rocm_kpack` and other `kpack` artifacts, they were missing, and I realized we had pinned the version by date. Instead, we should look for the most recent linux-multiarch tarball. ## Summary - Auto-discover the latest TheRock nightly tarball at Docker build time instead of pinning a stale URL (previously hardcoded to a Feb 2026 nightly that predates kpack) - Logic is to `wget` the directory, and identify the latest tarball (alphabetically sorted by YYYYMMDD in filename). - Support manual override via `--build-arg TARBALL_URL=...` for pinning, and `--build-arg TARBALL_PATTERN=...` for selecting a specific arch variant - Fix sccache download URL: `/releases/latest/download/` was redirecting to v0.15.0 but the filename referenced v0.14.0, causing a 404 ## Test plan - [x] Verified tarball discovery logic resolves to `therock-dist-linux-multiarch-7.13.0a20260430.tar.gz` - [x] Built Docker image locally with `--build-arg compiler_version=therock` - [x] Confirmed sccache installs successfully with the fixed URL - [ ] Verify CI pipeline builds with the updated Dockerfile 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-01 06:42:14 +00:00
ltqin	de0a61e5c2	[rocm-libraries] ROCm/rocm-libraries#6574 (commit b3db057) [CK_TILE] Add SageAttention v2 forward kernel with multi-granularity quantization (#6574) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary Add a CK_TILE forward kernel implementing [SageAttention v2](https://arxiv.org/abs/2411.10958) — an attention algorithm that applies multi-granularity quantization to Q/K/V before computing attention, trading minimal accuracy loss for higher throughput on low-precision hardware. ### Quantization design \| Tensor \| Supported data types \| Scale granularity options \| \|--------\|---------------------\|--------------------------\| \| Q \| fp8 / int8 / int4 \| per-tensor, per-block (128 tokens), per-warp (32 tokens), per-thread (4 tokens) \| \| K \| fp8 / int8 / int4 \| per-tensor, per-block (128 tokens), per-warp (64 tokens), per-thread (16 tokens) \| \| V \| fp8 \| per-channel (always) \| \| O \| bf16 \| — \| Three precision combinations are supported: `fp8/bf16` (QKV fp8, O bf16), `i8/fp8/bf16` (QK int8, V fp8, O bf16), and `i4/fp8/bf16` (QK int4, V fp8, O bf16). ### Architecture support - gfx9 (CDNA2/3, e.g. gfx90a, gfx942) — full tile set - gfx950 (CDNA4) — restricted tile set (N-per-block capped at 64 for fp8-family dtypes) ### Implementation - Two pipeline variants: `QRKSVS` (synchronous) and `QRKSVS_ASYNC` (async copy) - Masking support: no mask, causal (top-left / bottom-right), and generic windowed - Batch and group (variable-length) modes - Head dimension: d=128, d_v=128 - Python codegen under `example/ck_tile/49_sageattention/codegen/` generates kernel instances per target/dtype/tile combination - Smoke tests included via `tile_example_sageattn_fwd` ### Test commands \`\`\`bash # fp8 QKV ./build/bin/tile_example_sageattn_fwd -v=1 -b=16 -h=8 -s=1024 -d=128 -kname=1 -prec=fp8bf16 -qscale=3 -init=3 # int8 QK, fp8 V ./build/bin/tile_example_sageattn_fwd -v=1 -b=16 -h=8 -s=1024 -d=128 -kname=1 -prec=i8fp8bf16 -qscale=3 -init=3 \`\`\` \`-qscale\` values: 1=per-tensor, 2=per-block, 3=per-warp, 4=per-thread	2026-04-30 18:33:36 +00:00
Illia Silin	e8d64ad5c6	[rocm-libraries] ROCm/rocm-libraries#6741 (commit 0d4180f) [CK] restore fmha performance reporting and disable c++17 in CI. (#6741) ## Motivation This change restores monitoring of FMHA benchmarks performance in daily builds and removes the std=c++17 flag from CI builds on gfx90a. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-25 02:23:10 +00:00
Qianfeng	865ab2b8ed	[rocm-libraries] ROCm/rocm-libraries#6209 (commit 89c9f3e) Improve the performance of qr_ks_vs_whole_k_prefetch pipeline (#6209) ## About qr_ks_vs_whole_k_prefetch pipeline This PR updates and enhances the qr_ks_vs_whole_k_prefetch pipeline to improve performance on both MI350 GPUs through better MFMA instruction usage, transposed V-loading support, and N0-loop implementation. The pipeline targets scenarios where the number of workgroups is low, enabling better CU occupancy by using smaller MTile sizes (kM0=64 vs 128) while prefetching entire K tiles. ## Changes: - Adds transposed V-loading support (qr_ks_vs_whole_k_prefetch_trload) to avoid using shuffle instructions on MI350 - Implements N0-loop based Gemm0 to reduce tile window movement overhead and eliminate `clear_tile` calls - Adds full support for hdim96/hdim160 without padding requirements - Updates MFMA instruction selection to ensure optimal choices for MI350 ## Performance results 1. For attention shapes which leads to kM0=64, `qr_ks_vs_async_whole_k_prefetch_trload` shows much better performance than `qr_ks_vs_async_trload` on the same case (execution time `41.02ms` by whole_k_prefetch_trload & `58.50ms` by async_load), and `qr_ks_vs_async_whole_k_prefetch_trload` also shows obviously better performance than the recently tuned `qr_ks_vs_async` on the same case (execution time `41.02ms` by whole_k_prefetch_trload 7 `47.60ms` by qr_ks_vs_async) 2. Also on MI300, for attention shapes which leads to kM0=64, `qr_ks_vs_async_whole_k_prefetch` shows much better performance than the `qr_ks_vs_async` (which is supposed to be very high-efficient) on the same case (execution time `64.50ms` by whole_k_prefetch & `80.20ms` by qr_ks_vs_async) 3. For attention shapes which leads to kM0=128, `qr_ks_vs_async_whole_k_prefetch_trload` show a little bit better performance than `qr_ks_vs_async` on mi350 (execution time `104.50ms` by whole_k_prefetch_trload & `106.50ms` by qr_ks_vs_async). And they shows completely on-par performance on MI300 ## Test/Verify 1. Use the ROCM xformers branch `test_whole_k_prefetch_n0loop` to test/verify qr_ks_vs_whole_k_prefetch pipeline since this pipeline can not be used by ck_tile fmha example so far 2. Use the following command-line for building/testing xformers >```bash > #> git clone -b test_whole_k_prefetch_n0loop https://github.com/ROCm/xformers > #> git submodule update --init --recursive > #> pip install --no-build-isolation -e ./ > #> pytest tests/test_mem_eff_attention.py::test_forward >``` 4. Any scripts which can run on xformers can be used to evaluate qr_ks_vs_whole_k_prefetch pipeline. Using the two environ variable to switch from using different pipelines > ```bash > #> export FMHA_DISABLE_SPECIAL_TREATMENT=1 #> to disable using FAV3 and qr_ks_vs_async_trload pipeline > #> export FMHA_ENABLE_ASYNC_PIPELINE=1 #> to disable using qr_ks_vs_async pipeline for comparing > ``` ## Discussion	2026-04-24 16:31:59 +00:00
Gino Lu	eca3cb3e0a	sparse_attn: add bm0 dispatch for sparge blockmap compatibility Add bm0 field to fmha_jenga_fwd_traits so callers can specify the preferred Q-tile size. Codegen now emits separate tile configs for bm0=64 (sparge blockmap) and bm0=128 (original), with CppConstraint guards to select the right kernel at runtime. End-to-end test passes for both jenga and vsa paths. Performance is known to be suboptimal at this stage; tile sizes and warp counts for the bm0=64 path have not been tuned. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-24 05:13:51 -04:00
Yi DING	b2ea5fd315	[rocm-libraries] ROCm/rocm-libraries#6701 (commit f9a8d1c) [CK] Fix CI Failures for PR From Forks MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Fork PRs fail CI when `RUN_AITER_TESTS` or `RUN_FA_TESTS` is enabled. The docker scripts run `git clone -b "$CK__BRANCH" https://github.com/ROCm/rocm-libraries.git`, but a fork's branch doesn't exist upstream: ``` fatal: Remote branch <fork-branch> not found in upstream origin ``` Example: [PR #6529 build #4](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-6529/4/pipeline). ## Technical Details `Jenkinsfile`* — for PRs, use the upstream-visible PR ref instead of the head branch name: ```groovy CURRENT_BRANCH_NAME = env.CHANGE_ID ? "refs/pull/${env.CHANGE_ID}/head" : (env.CHANGE_BRANCH ? env.CHANGE_BRANCH : env.BRANCH_NAME) ``` `Dockerfile.aiter` / `Dockerfile.fa` — `git clone -b <ref>` only accepts branches (`refs/heads/`) and tags (`refs/tags/`), so it can't resolve `refs/pull/N/head`. Switch to `git fetch`, which accepts any refspec (and still works for plain branch names): ```sh mkdir rocm-libraries && cd rocm-libraries git init -q git remote add origin https://github.com/ROCm/rocm-libraries.git git fetch --depth 1 --filter=blob:none origin "$CK__BRANCH" git sparse-checkout init --cone git sparse-checkout set projects/composablekernel git checkout FETCH_HEAD ``` `git checkout FETCH_HEAD` lands in detached HEAD, which breaks the existing `git branch -m "$CK__BRANCH"` (and that name isn't a valid local branch anyway). Decouple the local branch name from the upstream ref: - Replace `git init` + `git branch -m` with `git init -b "$LOCAL_BRANCH"` (requires git ≥ 2.28, satisfied by base images) - `LOCAL_BRANCH="ck-import-${ROCM_LIBRARIES_SHA}"` in the rocm-libraries path; `LOCAL_BRANCH="$CK_*_BRANCH"` in the fallback - Downstream `git clone -b ... ../ck` uses `$LOCAL_BRANCH` ## Test Plan Manually trigger a build on this PR with `RUN_AITER_TESTS=true` and `RUN_FA_TESTS=true`; both docker images should build end-to-end. ## Test Result [jenkins / rocm-libraries-folder/Composable Kernel / PR-6701 / #3](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-6701/3/pipeline/) ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-24 08:23:24 +00:00
Jeff Huang	fdf4bb7fcc	[rocm-libraries] ROCm/rocm-libraries#6653 (commit 1df887e) [CK_TILE] fix(fmha): support >2GB KV cache in batch prefill via template dispatch (#6653) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation The CK batch prefill kernel previously failed (silent overflow + page faults) when the KV cache exceeded 2 GB, blocking long-context inference workloads (e.g., 128K+ token contexts with paged KV). Two distinct failure modes were addressed: 1. >4GB SRD overflow (`page_size < kN0`): The SRD `buffer_load_dwordx4` path uses a 32-bit `voffset` register; for small page sizes the rebased SRD spans the full KV pool and the offset wraps past 2 GB, corrupting K/V loads. 2. gfx950 page-table fault (`page_size >= kN0`): On CDNA4 the hardware validates the full SRD `num_records` range against page-table permissions (CDNA3 only checks per-instruction `voffset`). After per-tile SRD rebase, an un-trimmed `num_records` field extends past the live page and faults on freed/protected memory. ## Technical Details Two-mode `tile_scatter_gather` selected by the `kUseGlobalLoad` template parameter: \| Case \| `page_size` \| KV cache size \| Mode \| Load path \| Addressing \| \|---\|---\|---\|---\|---\|---\| \| 1 \| `>= kN0` (large pages) \| any \| SRD (`kUseGlobalLoad=false`) \| `buffer_load_dwordx4` \| 32-bit `voffset`, bounded by per-page rebase \| \| 2 \| `< kN0` (small pages) \| `<= 2 GB` \| SRD (`kUseGlobalLoad=false`) \| `buffer_load_dwordx4` \| 32-bit `voffset`, fits in INT32 byte range \| \| 3 \| `< kN0` (small pages) \| `> 2 GB` \| Global-load (`kUseGlobalLoad=true`) \| `async_load_tile_raw_flat` (K) + `load_tile_flat` (V) \| 64-bit \| Dispatch: the auto-gen API layer (`fmha_batch_prefill.py`) selects the kernel instantiation at launch from `(page_block_size, num_total_pages * batch_stride_k * kElementBytes)`, so the small-page penalty is paid only when correctness requires it. gfx950 SRD `num_records` trimming: in the K and V rebase lambdas of `block_fmha_batch_prefill_pipeline_qr_ks_vs_async`, `set_bottom_tensor_view_buffer_size(page_stride_k/v)` is called after each rebase to constrain `num_records` to the live page. Required for CDNA4 page-table validation; harmless on CDNA3. Pipeline sync for the global-load path: - V uses synchronous `load_tile_flat`; K uses `async_load_tile_raw_flat`. - `v_physical_pages_current` is double-buffered so the V flat load doesn't race against the next iteration's K rebase computation. Arch guards: `global_load_lds` intrinsics are gated to `__gfx94__` / `__gfx950__` (CDNA3+). Other architectures hit a `dependent_false` static_assert with a descriptive message. Device-side assertion convention: SRD setters use `__builtin_assume(cond)` (hint-only) rather than `<cassert>`'s `assert()`. The latter introduces an `__assert_fail` call whose register pressure scatters the K-SRD scalar register window across conditional branches, corrupting `buffer_load_dwordx4` on gfx950. ## Test Plan Tested on both MI308 (gfx942) and MI355 (gfx950) via the aiter wrapper test suite. All coverage lives in `op_tests/test_batch_prefill.py`: - Functional matrix (96 cases) — `test_batch_prefill`: `page_size ∈ {1, 16, 1024}` × `kv_layout ∈ {linear, vectorized}` × `dtype ∈ {bf16, fp8 quant variants}` × `causal` × `soft_cap` × `LSE` × `batch_size ∈ {1, 4}` (parametrized to exercise per-sequence SRD rebase across batch boundaries). - >2 GB coverage — `test_batch_prefill_large_kvcache`: extended to allocate a 5 GB+ KV cache pool and exercise both `kUseGlobalLoad=true` (small-page) and `kUseGlobalLoad=false` (large-page rebase) paths. Includes both single-batch and multi-batch (`batch_size=4`) cases to exercise per-sequence SRD rebase across the >2 GB pool. - Numerical reference: PyTorch SDPA, per-batch loop with `atol` / `rtol` from the existing batch prefill test harness. ## Test Result \| Arch \| `test_batch_prefill` \| `test_batch_prefill_large_kvcache` (>2 GB) \| \|------\|----------------------\|---------------------\| \| MI308 (gfx942) \| All passed \| Passed \| \| MI355 (gfx950) \| All passed \| Passed \| Performance impact (gfx950, hot SRD path): - +2.67% kernel-time on `seqlen=1024 / page_sz=1024 / bf16 / sglang / causal / soft_cap=30`, attributable in full to the two `set_bottom_tensor_view_buffer_size` calls in the K/V rebase lambdas (5-run median, signal/noise ≈ 9×). - This cost is mandatory for gfx950 correctness on >2 GB workloads — removing the setters re-introduces page-faults. - gfx942: 0 regressions in the same range (all configs ≤ +0.97%). ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-23 23:09:25 +00:00
Luo Cheng	2fae12cbbb	[rocm-libraries] ROCm/rocm-libraries#6242 (commit f46ac14) [CK] Fix out of bounds modifications caused by negative topk_ids in MoeSortingMultiPhaseKernel_P0_v1 (#6242) ## Motivation Fix sglang randomly crash by filter negative topk ids. ## Technical Details In sglang expert parallel mode, there may be idle batch (batch=0) fired, it will reuse batch=1 resource in cuda graph mode. But in topk op, it will set non used topk ids to -1, in idle batch case, all topk ids are set to -1. In `MoeSortingMultiPhaseKernel_P0_v1` negative expert id will cause overwrite somewhere and sglang may randomly crash. Except idle batch case, if the captured batch sizes are discrete, there may be -1 of expert id due to the similar logic. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: zovonoir <jialzhu@amd.com>	2026-04-23 22:45:32 +00:00
Artem Kuzmitckii	281d1bf50b	[rocm-libraries] ROCm/rocm-libraries#6132 (commit e97065d) [CK] Fix divide-by-zero crash for grouped conv kernels (#6132) ## Motivation During run pytorch unit tests for conv3d: `test_dtypes_nn_functional_conv3d_cuda`, `test_fake_crossref_backward_amp_nn_functional_conv3d_cuda_float32` found divide-by-zero crash during CK kernel selection. Refs ROCM-20764 ## Technical Details Add assert for K0PerBlock equal 0, also covered other potential places related with k_batch calculation. ## Test Plan Run miopen command extracted from mentioned test: `MIOpenDriver convfp16 --spatial_dim 3 -I NCDHW -O NCDHW -f NCDHW -n 1 -c 1 -k 1 -g 1 --in_d 4 -H 4 -W 4 --fil_d 4 -y 4 -x 4 --pad_d 0 -p 0 -q 0 --conv_stride_d 2 -u 2 -v 2 --dilation_d 1 -l 1 -j 1 -m conv -F 4 -t 1` ## Test Result Passed ## Submission Checklist - [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Signed-off-by: Artem Kuzmitckii <artem.kuzmitckii@amd.com>	2026-04-23 20:12:40 +00:00
KateJu	793a59736a	[rocm-libraries] ROCm/rocm-libraries#6656 (commit 1c958f8) Fix per-layer conv2d int8 CPU verification reference path (#6656) case example_conv2d_fwd_xdl_perlayer_quantization_int8.exe 1 0 ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-23 14:09:58 +00:00
KateJu	34c7c3bbf2	[rocm-libraries] ROCm/rocm-libraries#6655 (commit 677b38d) Add missing lds sync ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-23 14:06:27 +00:00
jakpiase	ad412c26f3	[rocm-libraries] ROCm/rocm-libraries#6624 (commit 47d0162) [CK_TILE] Grouped Convolution Backward Data Direct Load (#6624) ## Proposed changes Add Grouped Convolution Backward Data with Direct Load into DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffleV3 device implementation. This enables direct global memory loading (bypassing LDS) for the backward data convolution path on gfx950, following the same pattern used in both backward weight and forward convolution. Direct load convolution backward data improves performance by avoiding LDS round-trips for certain configurations on gfx950, which supports a wider range of instructions. Currently correctness is checked only at usage point, but should be extended to a standalone UT in the future.	2026-04-23 09:17:50 +00:00

1 2 3 4 5 ...

3298 Commits