Commit Graph

3298 Commits

Author SHA1 Message Date
Gino Lu
fb75da2467 sparse_attn: wire -mask and -attention_sink (block-map prune + attn mask) 2026-05-19 23:22:00 -04:00
Gino Lu
b3ea819ff7 sparse_attn: annotate PV-skip chart with speedup vs dense baseline 2026-05-19 22:42:49 -04:00
Gino Lu
9e3f8838de sparse_attn: drop stale FMHA-vs-sparge perf section from README 2026-05-19 22:37:55 -04:00
Gino Lu
d939c3b4fc sparse_attn: split-launch dispatch + 3-mode PV-skip
- Per-head pv_threshold via head_remap LUT (CLI: -pv_threshold_per_head);
  sentinel 1e30 routes to kEnablePVSkip=false bucket
- kEnablePVSkip bool → PVSkipMode enum {kNone, kPerWarp, kPerBlock};
  new kPerBlock matches upstream sm80 (LDS vote, V loads unconditional).
  CLI: -pv_mode={none,warp,block}, default warp
- README: PV-skip modes section + MI300X 3-curve sparsity chart

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-05-19 21:45:23 -04:00
Gino Lu
304c1f9244 Merge remote-tracking branch 'origin/develop' into ginolu/sparge_attention 2026-05-19 21:34:32 -04:00
Illia Silin
37950ea4eb [rocm-libraries] ROCm/rocm-libraries#7547 (commit 7e032ad)
[CK] fix daily builds for pytorch

## Motivation

This will restore the daily builds that test whether the latest pytorch
code can build with the latest CK code (pulled from the standalone CK
repo).

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-19 14:14:32 +00:00
Aaryaman Vasishta
457f153b69 [rocm-libraries] ROCm/rocm-libraries#7016 (commit 2b73c00)
[CK] Fix RDNA3 FMHA tile-load paths

## Summary

Fix CK tile FMHA paths needed for RDNA3/RDNA4 targets.

## Details

This PR addresses RDNA-specific issues hit while enabling xFormers CK
FMHA on gfx11/gfx12:

- On RDNA3, update FMHA P tile handling so the layout consumed by the
second GEMM matches the WMMA path.

## Testing

Validated downstream with xFormers CK/FMHA on gfx1201/gfx1151.

```text
pytest --import-mode=importlib -q \
  tests/test_mem_eff_attention.py::test_forward \
  tests/test_mem_eff_attention.py::test_backward \
  tests/test_mem_eff_attention.py::test_dropout_ck

3844 passed, 5244 skipped, 26 warnings
2026-05-19 13:42:43 +00:00
Po Yen Chen
424dfec6e4 [rocm-libraries] ROCm/rocm-libraries#7530 (commit 378e049)
[CK] Fix FMHA sink dispatch when init_sink_value is set
 (#7530)

## Summary
- Fix `traits.has_sink` in `fmha_fwd_runner.hpp` to also check
`init_sink_value != 0`, so the GPU kernel dispatches with sink support
when `-init_sink=1` is passed.
- Gate `run_sink_mask_tests` (StreamLLM) and `run_sink_init_tests`
(GPT-OSS) behind opt-in flags `-m` and `-g` in `smoke_test_fwd.sh`.
These tests require sink=true kernel instances which are excluded by the
`BUILD_TESTING` CMake filter (`*_nsink*`), causing unconditional "not
supported yet" failures (48 tests in CI). The opt-in flag approach was
borrowed from PR #6057.

## Why gate tests instead of compiling sink=true kernels?

The `BUILD_TESTING` filter in `CMakeLists.txt` uses `*_nsink*` glob
patterns for the `fwd` and `fwd_splitkv` APIs, excluding sink=true
kernel instances from compilation. We chose opt-in flags over widening
the filter because:

- **Compile time**: Enabling sink=true kernels doubles the kernel
variants for `fwd` and `fwd_splitkv` APIs. The filter exists
specifically to reduce CI build times.
- **Incremental enablement**: Sink support (StreamLLM / GPT-OSS) is
still maturing. Gating lets teams opt in explicitly (`smoke_test_fwd.sh
-g`) while keeping the default CI path fast.
- **Precedent**: splitkv (`-s`) and appendkv (`-a`) tests already follow
this opt-in pattern.

## Test plan
- [ ] Run `smoke_test_fwd.sh -g` with sink=true kernels compiled and
verify sink-enabled kernels are dispatched
- [ ] Verify `smoke_test_fwd.sh` still passes without `-m` / `-g` flags
- [ ] Confirm CI no longer fails on sink tests (they are now opt-in)
2026-05-18 16:10:30 +00:00
Gino Lu
0f8b58ac88 sparse_attn: R25 Step 1 A1 — per-warp PV-skip (paper Algorithm 1) + V0 instantiation
Preserve the R25 Step 1 "A1 / redesign D" state before redesigning toward "B"
(per-CTA PV-skip matching upstream shipped reference). This snapshot lets us
restore A1 if the B redesign fails.

A1 redesign D pipeline (per-warp, arithmetic-only PV-skip, wrapped in
`if constexpr (kEnablePVSkip)`):
  - include/ck_tile/ops/sparse_attn/pipeline/block_fmha_pipeline_qr_ks_vs_async_sparge.hpp
  - include/ck_tile/ops/sparse_attn/kernel/fmha_fwd_sparge_kernel.hpp

V0 instantiation wiring (per gino_tmp/R25/programmer/v0_instance/REPORT.md):
  - example/ck_tile/50_sparse_attn/codegen/ops/fmha_fwd_sparge.py
  - example/ck_tile/50_sparse_attn/fmha_fwd_trek.hpp
  - example/ck_tile/50_sparse_attn/sparge_blockmap_trek.hpp
  - example/ck_tile/50_sparse_attn/sparge_blockmap_inst.cpp
  - example/ck_tile/50_sparse_attn/codegen/cpp_symbol_map.py
  - example/ck_tile/50_sparse_attn/CMakeLists.txt
  - example/ck_tile/01_fmha/CMakeLists.txt
  - example/ck_tile/50_sparse_attn/test_sparge.cpp (-pv_skip_compile=0|1 CLI)

This commit excludes all *_REVIEW.{hpp,cpp} mirror files (left untracked) and
all build artefacts. _vsa.hpp / _jenga.hpp are not modified.

Tag: R25-step1-A1-paper-aligned points at this commit.
2026-05-18 06:13:38 -04:00
Vidyasagar Ananthan
86591de476 [rocm-libraries] ROCm/rocm-libraries#5260 (commit a1834d2)
[CK] [CK_Tile] Add FMHA scaffolding to CK kernel dispatcher
 (#5260)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

The CK Tile dispatcher currently supports GEMM and Grouped Convolution
but has no support for Fused Multi-Head Attention (FMHA). The
example/ck_tile/01_fmha folder contains a comprehensive FMHA
implementation with forward, backward, split-KV, paged-KV, append-KV,
and batch-prefill kernels across multiple GPU architectures — but there
is no unified dispatch layer for it. This PR ports the FMHA stack into
the dispatcher, following the same architectural patterns established by
GEMM and Grouped Convolution, enabling runtime kernel selection, JIT
compilation from Python, and a declarative C++ example flow. Autotuning
heuristics to follow.

## Technical Details

This PR adds FMHA scaffolding to the CK dispatcher framework, mirroring
GEMM's layered architecture. Seven new C++ runtime headers provide type
definitions (coexisting with upstream headers via __has_include,
requiring zero modifications to example/ck_tile/01_fmha/), a problem
builder with 18+ setters, Signature + Algorithm kernel key matching, a
virtual kernel instance, a DECL_FMHA_KERNEL_SET macro with wildcard
support and named tile/wave/warp setters, arch-aware registry with JSON
export, and a dispatcher with seqtune-aware selection, configurable
timing, and multi-stage execution plans for split-KV (two-stage) and
backward (three-stage). The codegen pipeline is driven by a
fmha_arch_specs.json capturing per-arch tile tables and pipeline
constraints for five architectures (gfx90a/942/950/1100/1201), migrated
from hardcoded logic in 01_fmha/codegen/, with supporting modules for
C++ symbol mappings, validation rules, and named receipt profiles
(ck_default, flash, pytorch, aiter, fp32, fp8). Python integration
(fmha_utils.py) mirrors the C++ layer with JIT compilation, parallel
multi-kernel builds, HIP memory management via ctypes, tolerance-based
validation, and a NumPy CPU reference with GQA support. Twenty-seven C++
and thirty-two Python examples cover the full feature surface — forward,
split-KV, masks, bias, dropout, GQA, backward, append-KV, batch prefill,
fp8, logits soft cap, sink tokens, and parameter sweeps — all
JIT-compiled on the fly.

## Test Plan

Seven test files cover the runtime types, codegen, and end-to-end
correctness. C++ unit tests validate the problem builder, dispatcher
planning (single-stage for forward/paged-KV/append-KV; multi-stage for
split-KV and backward), registry operations, and the kernel-set
declaration macro. Python unit tests verify codegen emission, profile
filtering, and 15 validation rules for masks, hdim constraints, and
pipeline requirements. GPU execution validation in 01_basic_fmha
--validate reports zero errors across 65,536 elements with max absolute
error of 7.29e-05. A gold-standard parity suite (test_fmha_parity.py)
runs 14 configurations through both the upstream tile_example_fmha_fwd
and the dispatcher, comparing exit codes to confirm behavioral parity —
all 14 match.

## Test Result

The C++ smoke test builds and passes all 9 compiled examples, and a
Python JIT sweep (29_sweep_seqlen.py) passes 7/7 configurations reaching
up to 375 TFLOPS at seqlen 2048.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-17 07:30:33 +00:00
Gino Lu
840b8a37d9 test(sparse_attn): CPU-ref cross-check + BLKQ cite
Wire SpargeAttn CPU reference into test_sparge: build the block_map on host via
sparge::build_block_map_meansim and cross-check against the GPU-produced map;
self-check the VSA delta-LUT (valid count + reachable kb indices); split PASS/FAIL
into separate block_map / LUT / attention-output lines for clearer diagnosis.

Set sparge_tool::SpargeParams::BLKQ default to 64 to match SpargeAttn SM90
convention (cite upstream qk_int_sv_f8_cuda_sm90.cu:143-144); tighten bf16
tolerance back to the dense FMHA baseline (4e-2 atol, 1e-2 rtol).

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-05-17 02:35:51 -04:00
Gino Lu
879d50836e cleanup(sparse_attn): R-tag rename + clang-format sweep
Strip internal R-tag / phase labels (R20, R21A/B, Round 8/13f, Track F, B2.v3,
Phase 1/2/3) from comments — replace with descriptive names so future readers
don't need the change-log. Reflow long signature in fmha_fwd_trek.hpp.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-05-17 02:35:07 -04:00
Gino Lu
7103eacc99 refactor(sparse_attn): caller-owned workspace + dtype-aware sizing
Replace process-lifetime lazy hipMalloc K-stats workspace with a caller-owned
buffer; expose sparge_blockmap_get_workspace_size() / compute_workspace_layout()
host helpers. Split the combined sparge_blockmap_fwd into stage launchers
(sparge_kstats_fwd_oneshot + sparge_blockmap_only_fwd_oneshot) so the chained
launch is timed end-to-end.

Make pooled_k storage dtype follow KDataType (fp16/bf16) instead of fp32 to halve
workspace footprint and match dense-FMHA precision. Tighten per-head superparam
pointers to required (non-null) and assert N_k <= 256 in jenga MakeKargs to
document the 256-bool LDS staging cap. Drop the obsolete VSA extra-LDS staging.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-05-17 02:34:23 -04:00
Gino Lu
668e107282 fix(sparse_attn): backport PR #4742 LDS s_barrier
Add s_barrier after sched_barrier when K-tail and V share LDS buffer,
mirroring upstream PR #4742. Applies to both async_vsa and async_jenga pipelines.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-05-17 02:30:48 -04:00
Jobbins
61b019f2a2 [rocm-libraries] ROCm/rocm-libraries#6961 (commit 47e8768)
[CK] print hostname and $NODE_NAME to find inconsistencies
 (#6961)

## Motivation

We suspect that the check for amdgpu: `cat /sys/module/amdgpu/version`
sometimes gets ran on the Jenkins controller instead of the node. This
adds the `hostname` command to compare to the $NODE_NAME variable.

## Technical Details

Updated Jenkinsfile to include the `hostname` command.

## Test Plan

## Test Result

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-15 21:50:09 +00:00
John Shumway
3e110e1718 [rocm-libraries] ROCm/rocm-libraries#7114 (commit ecef372)
[CK] Add rocm_ck foundation types: DataType, Layout, Args,
 Ops (#7114)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

- Add the vocabulary types that all rocm_ck schema headers build on
- 9 new headers under `include/rocm_ck/`, 6 unit test files
- Pure C++20, host-only — no CK Tile dependencies

**Headers:**

| Header | Purpose |
|--------|---------|
| `index_t.hpp` | `index_t`, `long_index_t` (matches ck_tile) |
| `gpu_target.hpp` | `GpuTarget` enum (ISA targets) |
| `datatype.hpp` | `DataType` enum (17 variants) |
| `layout.hpp` | `Layout` enum (Row, Col, Auto) + stride helpers |
| `fixed_string.hpp` | `FixedString<N>` — structural string for NTTPs |
| `args.hpp` | Generic kernel argument buffer (ABI) |
| `ops.hpp` | Operator structs (`GemmOp`, `AddOp`, ...) + `Op` variant |
| `physical_tensor.hpp` | `PhysicalTensor` — maps names to Args slots |
| `resolved_tensor.hpp` | `ResolvedTensor` — output of
`Signature::resolve()` |

**Stack**: This is PR 1 of 3 porting the rocm_ck constexpr schema from
experimental to production, #7143.
1. **This PR** — Foundation types (vocabulary)
2. Schema engine — `Signature`, `resolve()`, `ArchProperties`
3. Spec factories — `GemmSpec`, `ElementwiseSpec`, `makeSpec()`

## Test plan

- [ ] `ninja build-smoke-rocm-ck` builds all tests
- [ ] `ctest -L ROCM_CK_SMOKE --output-on-failure` — 6 unit tests pass
(86 test cases)
- [ ] Default CK build (`CK_ENABLE_ROCM_CK=OFF`) unaffected

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-05-15 19:22:44 +00:00
Illia Silin
187ef8ac94 [rocm-libraries] ROCm/rocm-libraries#7471 (commit 13b9eec)
[CK] increase timeout limit for fmha_fwd tests to avoid CI
 failure on gfx11 (#7471)

## Motivation

This should prevent fmha_fwd tests from timing out on one of the slower
gfx11 CI nodes and generating false CI failures.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-15 16:42:53 +00:00
peterjunpark
fcece1e838 [rocm-libraries] ROCm/rocm-libraries#7380 (commit 50b369d)
docs(composablekernel): update install pages for 7.13

## Motivation

Update instructions to install prebuilt packages. Generally, point to
ROCm Core SDK install instructions. List a more granular package if
useful to users.

Preview:
https://rocm.docs.amd.com/projects/composable_kernel/en/users-peterjunpark-ck-7.13-install/install/Composable-Kernel-install.html

>[!NOTE]
>Some links appear as plain text in the preview b/c the target page
isn't publicly accessible yet.

Related to:
- [x] composablekernel https://github.com/ROCm/rocm-libraries/pull/7380
  - [x] hipblas https://github.com/ROCm/rocm-libraries/pull/7378
  - [x] hipblaslt https://github.com/ROCm/rocm-libraries/pull/7379
  - [x] hipcub https://github.com/ROCm/rocm-libraries/pull/7377
  - [x] hipdnn https://github.com/ROCm/rocm-libraries/pull/7376
  - [x] hipfft https://github.com/ROCm/rocm-libraries/pull/7375
  - [x] hiprand https://github.com/ROCm/rocm-libraries/pull/7374
  - [x] hipsolver https://github.com/ROCm/rocm-libraries/pull/7371
  - [x] hipsparse https://github.com/ROCm/rocm-libraries/pull/7373
  - [x] hipsparselt https://github.com/ROCm/rocm-libraries/pull/7372
  - [x] miopen https://github.com/ROCm/rocm-libraries/pull/7370
  - [x] rocblas https://github.com/ROCm/rocm-libraries/pull/7369
  - [x] rocfft https://github.com/ROCm/rocm-libraries/pull/7368
  - [x] rocprim https://github.com/ROCm/rocm-libraries/pull/7367
  - [x] rocrand https://github.com/ROCm/rocm-libraries/pull/7366
  - [x] rocsolver https://github.com/ROCm/rocm-libraries/pull/7364
  - [x] rocsparse https://github.com/ROCm/rocm-libraries/pull/7365
  - [x] rocthrust https://github.com/ROCm/rocm-libraries/pull/7363
  - [x] rocwmma https://github.com/ROCm/rocm-libraries/pull/7362

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## JIRA ID

<!-- If applicable, mention the JIRA ID resolved by this PR (Example:
Resolves SWDEV-12345). -->
<!-- Do not post any JIRA links here. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-15 14:58:51 +00:00
Bartłomiej Kocot
945849b0f5 [rocm-libraries] ROCm/rocm-libraries#6838 (commit ff7a665)
[CK_TILE] Add depthwise conv2d forward kernel (FP16/FP32)
 (#6838)

## Motivation

CK currently has no kernel optimized for depthwise convolution
(G=C_in=C_out, C=K=1 per group) and existing generic paths perform
poorly for this workload. This PR adds a dedicated depthwise conv
forward kernel in CK Tile.

## Technical Details

Adds a dedicated depthwise conv2d forward op to CK Tile that performs
direct convolution rather than falling back to the generic GEMM path.
The kernel is templatized by filter size, stride, and data type, and
compiled into ~60 instances covering common configurations (kernel
3/5/7/9, stride 1/2, FP16/FP32). Supports both CDNA (gfx942/gfx950) and
RDNA (gfx1100/gfx1200) architectures.

## Test Plan

- [x] Correctness and performance validated on gfx942, gfx950, and
gfx1100, with ckProfiler `grouped_conv_fwd` as baseline.
- [ ] MI300A (gfx942) and gfx1200 validation.

## Submission Checklist

- [x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-1137
2026-05-15 13:48:51 +00:00
Yaswanth Raparti
fe2e29fa68 [rocm-libraries] ROCm/rocm-libraries#7289 (commit e3fb4ee)
[CK] Fix smart build false positives from merged commits
 (#7289)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation
Current smart-build infrastructure triggers full build for almost every
PR which is draining our CI infrastructure. Need to update the test
selection logic based on diffs from the current workspace instead of
entire repo.

## Technical Details
Use three-dot syntax and scope BUILD_INFRA_PATTERN to composablekernel.

Changes:
- Switch from two-dot (..) to three-dot (...) in git diff
  - Three-dot shows only PR-specific changes
  - Excludes commits merged from develop (prevents false positives)
- Scope BUILD_INFRA_PATTERN to projects/composablekernel/ paths only
  - Avoids triggering on other projects (hipblas, hipdnn, etc.)
  - Only composablekernel build infra changes trigger full build
- Update both ci_safety_check.sh and validate_pr.sh

## Test Plan
Test with PR 7112 and 7223

## Test Result
Impact:
- PR 7112: Was 620 files (false positive) → Now 6 files (correct)
- PR 7223: Was full build (false positive) → Now selective build
(correct)

## Submission Checklist

- [ x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-14 19:32:32 +00:00
John Shumway
6cd06382b3 [rocm-libraries] ROCm/rocm-libraries#7090 (commit 316fded)
[CK] Add rocm_ck directory structure with feature flag
 (#7090)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

Adds initial rocm_ck directory structure, #7119.

- Establishes production `rocm_ck/` directory at
`composablekernel/rocm_ck/`, peer to `tile_engine/` and `dispatcher/`
- Adds `CK_ENABLE_ROCM_CK` option (default OFF) as a CK-internal feature
flag — no superbuild or TheRock changes needed
- Creates `rocm_ck` INTERFACE library, `ck_tile_headers` target, GTest
integration with builder-style convenience targets (`smoke-rocm-ck`,
`check-rocm-ck`)
- Adds Jenkins `RUN_ROCM_CK_TESTS` parameter for CI, following the
`RUN_BUILDER_TESTS` pattern
- README explains the constexpr schema model: host-device separation via
constexpr data rather than template parameters, enabling multi-arch
distribution through kpack archives

## Test plan

- [x] `cmake -DCK_ENABLE_ROCM_CK=ON` configures without errors
- [x] `ninja check-rocm-ck` passes (4 host-only index type tests)
- [x] Default build (`CK_ENABLE_ROCM_CK=OFF`) is unaffected — no rocm_ck
targets present
- [x] Jenkins `RUN_ROCM_CK_TESTS=true` enables the flag and runs
`check-rocm-ck`

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-05-14 18:52:38 +00:00
Meekail Zain
d931e8703d [rocm-libraries] ROCm/rocm-libraries#6867 (commit 3cb0219)
Added custom FMHA codegen receipt for TransformerEngine
 (#6867)

## Motivation

TE uses AITER to build static MHA libraries, which ultimately rely on CK
kernels. We use the `600` receipt which generates more kernels than TE
truly needs. This bespoke receipt allows us to minimize the kernel
count, compile time, and memory footprint of our MHA library.

## Technical Details

Extended the receipt mechanism to include a custom `700` receipt for
TE's needs

## Test Plan

Test by building TE using the same receipt profile

## Test Result

Build validated in TE using a custom feature branches of AITER/CK to
temporarily apply the patch

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-14 14:34:03 +00:00
Yi DING
83566edb0f [rocm-libraries] ROCm/rocm-libraries#7331 (commit 5692db0)
[CK_TILE] Add async workspace prepare to FMHA BWD launcher
 (#7331)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

`aiter::mha_bwd` in group mode currently issues two synchronous
`hipMemcpy` D2H copies to read `seqstart_q/k` for launcher construction.
These sync copies block the host (~10–30 µs each) and implicitly
synchronize the device by draining the stream, breaking CPU/GPU overlap
on hot training paths.

This PR adds a fully stream-async workspace preparation path on the FMHA
BWD launcher so callers can pre-allocate the device workspace from
upper-bound shapes and stage seqstart-dependent metadata via
D2H/host-pack/H2D entirely on the user's stream.

## Technical Details

- `FmhaBwdWorkspaceManager::GetWorkspaceDeviceSizeUpperBound`
(`include/ck_tile/ops/fmha/kernel/fmha_bwd_kernel.hpp`): computes the
worst-case device dq_acc size from `(max_batch, hdim_q, nhead_q,
max_seqlen_q, max_seqlen_k)` without dereferencing any seqstart array.
Mirrors `PrepareWorkspaceHost`'s return value with worst-case bounds.
- `fmha_bwd_launcher::prepare_workspace_async`
(`example/ck_tile/01_fmha/fmha_bwd.hpp`): on the caller's stream, in
order:
  1. `hipMemsetAsync` of the dq_acc region (when `NeedsZeroDqAcc()`)
2. group mode: `hipMemcpyAsync` D2H of `seqstart_q/k` into a pinned host
staging buffer
3. `hipLaunchHostFunc` runs `PrepareWorkspaceHost` on the pinned buffer
  4. `hipMemcpyAsync` H2D of the packed metadata into `device_ws_ptr`

The pinned staging buffer is held via `std::shared_ptr<void>` returned
by a caller-provided `pinned_host_alloc` callback. Lifetime is extended
past stream completion by a tail `hipLaunchHostFunc` scheduled in the
launcher's destructor.

- `ck_tile::pinned_host_releaser`
(`include/ck_tile/host/pinned_host_releaser.hpp`): worker-thread utility
for callers using bare `hipHostMalloc`. Defers `hipHostFree` off the HIP
driver callback thread, which holds runtime locks and would deadlock
against concurrent main-thread `hipFree`. PyTorch's
`CachingHostAllocator` does not need this.

- Example runner (`example/ck_tile/01_fmha/fmha_bwd_runner.hpp`):
switched to the async path.

## Test Plan

- `tile_example_fmha_bwd` (gfx950, dev preset `-Werror -Weverything`):
  - batch + nondet / batch + det / group + nondet / group + det
- group + det 4-batch varlen (`-b=4 -h=8 -s=4096,3072,2048,1024 -d=128`)
- FA (`flash-attention`) integration on ROCm 7.1.1 + PyTorch 2.9.1:
  - `tests/test_flash_attn_ck.py::test_flash_attn_varlen_deterministic`
  - `tests/test_flash_attn_ck.py::test_flash_attn_bwd_varlen_seqq_zero`

## Test Result

- All CK runner cases `valid:y`.
- FA pytest: **1952 passed in 44.82s**.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-14 13:34:32 +00:00
Copilot
4d852e80fb [rocm-libraries] ROCm/rocm-libraries#6983 (commit f4e9a84)
Remove batch_prefill from FMHA_FWD_KNOWN_APIS

Remove `batch_prefill` from the `FMHA_FWD_KNOWN_APIS` list in
`projects/composablekernel/example/ck_tile/01_fmha/CMakeLists.txt`.

**Change:**
```cmake
# Before
set(FMHA_FWD_KNOWN_APIS "fwd;fwd_splitkv;fwd_appendkv;pagedkv_prefill;batch_prefill")

# After
set(FMHA_FWD_KNOWN_APIS "fwd;fwd_splitkv;fwd_appendkv;pagedkv_prefill")
```

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: asleepzzz <4926646+asleepzzz@users.noreply.github.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2026-05-14 12:42:14 +00:00
Qianfeng
acf3d65966 [rocm-libraries] ROCm/rocm-libraries#7256 (commit 1fc20eb)
=?UTF-8?q?Skip=20numeric=20drop-out=20when=20PComputeWind?=
 =?UTF-8?q?ow=20is=20a=20null=5Ftile=5Fwindow=20in=20Bl=E2=80=A6=20(#7256)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The BlockDropout implementation already provides very complete logic for
generating random numbers and executing dropout for the P tensor after
first attention Gemm with capability to support both Warp-Gemm 32x32 and
16x16 as well as to run on both wave32 and wave64 arch.

But in some situation, we only need the block-layer process to generate
random numbers, rather than simultaneously execute dropout in real-time
on the vgpr tile. For example, xformers'
`test_mem_eff_attention.py::test_dropout_ck` requires the host reference
implementation of `attention forward with dropout` to use the same
random numbers to compare & verify the device side implementation of
`attention forward with dropout`, so a standalone kernel to generate
random numbers only is required.

This PR will enable xformers's random_val generating kernel (in file
`ck_tiled_rand_uniform_kernel.h`) to depend on BlockDropout's `Run()`
operator completely to generate random numbers for a `[MPerBlock,
NPerBlock]` tile during the tile iteration, no need to replicate the
logic of BlockDropout in the xformers kernel
2026-05-13 09:42:28 +00:00
Linjun-AMD
5c7b7ec3f1 [rocm-libraries] ROCm/rocm-libraries#7272 (commit d02f3c0)
[ck_tile][fmha_bwd] Fix sink_host OOB in group mode reference
 runner (#7272)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

In `fmha_bwd_runner.hpp`, the `sink_host` `HostTensor` is allocated with
first
dimension `shape_batch` (= 1 in group mode), but the reference forward
loop
accesses `sink_host(wb, i_h)` with `wb ∈ [0, batch-1]`. For any `wb >=
1` this
is an out-of-bounds heap read, silently corrupting the reference forward
math
chain (`lse_host`, `o_host`) and turning the bwd-side `d_sink_head_acc`
  reference into non-deterministic garbage.

`HostTensor::operator()` does not bounds check, so the OOB is not caught
at
runtime. This manifests as intermittent `tile_example_fmha_bwd` failures
(25–67% fail rate) when `-sink_grad=1` is combined with `-mode=1` (group
mode),
  with bit-exact but spurious `max_err` values like 4.27 / 14.6.

  ## Fix

One-line: allocate `sink_host` with `batch` (the real per-batch dim)
instead of
  `shape_batch`, mirroring how `sink_host` is accessed by the loop.

  ```diff
  -    sink_grad ? std::array<ck_tile::index_t, 2>{shape_batch, nhead}
  +    sink_grad ? std::array<ck_tile::index_t, 2>{batch, nhead}

  Repro

  tile_example_fmha_bwd -b=2 -h=2 -s=516 -s_k=253 -prec=bf16 -d=72 \
    -bias=n -dbias=0 -p_drop=0 -iperm=1 -operm=1 -deterministic=0 \
    -v=3 -mode=1 -kname=1 -sink_grad=1

  Verification

  - 0/30 fail on the repro config after fix
  - Baselines (before fix):
    - sink=1, mask=n: 25% fail rate (p ≈ 1.8e-4)
    - sink=1, mask=t: 67% fail rate (p ≈ 6e-15)

  Attribution

Shape bug introduced together with sink_grad in #5504. Unrelated to
#6914
  (which is a fwd-only fix on a different code path)
```

## Submission Checklist

- [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-13 08:49:13 +00:00
Yaswanth Raparti
6989cf800c [rocm-libraries] ROCm/rocm-libraries#6327 (commit 1e7a12e)
[CK][CK TILE] Dispatcher kernel selection heuristic for
 grouped conv (#6327)

## Motivation
The ML heuristic in dispatcher does not support grouped-conv operator
yet. In this PR, the support for fwd, bdw-data, and bwd-weight
grouped-conv kernels have been added. A tile_engine utility has also
been added to compile and run any selected kernel configuration through
dispatcher infrastructure.

## Technical Details

1. Tile engine utility is added to benchmark each shape with all the
possible kernel+tile_size combinations here -
[https://github.com/ROCm/rocm-libraries/blob/users/yraparti/ck/dispatcher-grouped-conv-heuristics/projects/composablekernel/tile_engine/ops/grouped_conv/grouped_conv_full_benchmark.py](url)
2. New LGBM regressor models for grouped conv are added to models
directory. We have 3 separate models for fwd, bwd-data, and bwd-weights
[https://github.com/ROCm/rocm-libraries/tree/users/yraparti/ck/dispatcher-grouped-conv-heuristics/projects/composablekernel/dispatcher/heuristics/models](url)
3. Implemented lazy GPU initialization (dispatcher/python)
- **Issue**: ProcessPoolExecutor fork() + GPU context caused memory
access faults
- **Solution**: Mirror FMHA pattern - defer GPU initialization until
first run()
  - **Changes**:
- setup_multiple_grouped_conv_dispatchers() returns List[Path], not
loaded libs
    - GpuGroupedConvRunner.__init__() no longer calls ctypes.CDLL
    - Added _ensure_initialized() method for lazy GPU loading
    - GPU context created only on first run() call
  - **Benefit**: Parallel compilation now works without GPU conflicts
4. Addressed few miscellaneous issues such as:
  - Fixed BF16->FP16 naming bug in the dispatcher wrapper
- Added new tile sizes, and comp_v5 pipeline to the arch spec to expand
the kernel selection
- Added automatic padding support for unsupported shapes in dispatcher
runner
- Created a single source of truth between tile_engine and dispatcher
about the architecture and tile_size details
- Build a validation scripts to compare oracle_best vs ml_heuristic
comparison

## Test Plan

1. Validated fwd, bwd-data, and bwd-weight kernels with both known and
unseen data sets with up to 300 problems.
2. Ensured that test cases are added in both dispatcher and tile_engine
to validate the heuristic.

## Test Result
Results on Unseen shapes validated on gfx950
#### Forward Pass Model
- **Training Data**: 48,845 measurements across 1,372 unique problem
shapes
- **Validation Set**: 300 unseen problems from model crawler
- **Validation Performance** (vs. oracle):
  - Mean Efficiency: **93.05%**
  - Median Efficiency: **96.8%**
  - P10 Efficiency: **79.9%**

#### Backward Data Gradient (bwd_data) Model
- **Training Data**: 18,773 measurements across 891 unique problem
shapes
- **Validation Set**: 300 unseen problems from model crawler
- **Validation Performance** (vs. oracle):
  - Mean Efficiency: **93.8%**
  - Median Efficiency: **96.5%**
  - P10 Efficiency: **82.9%**

#### Backward Weight Gradient (bwd_weight) Model
- **Training Data**: 34,900 measurements across 1,508 unique problem
shapes
- **Validation Set**: 300 unseen problems from model crawler
- **Validation Performance** (vs. oracle):
  - Mean Efficiency: **96.1%**
  - Median Efficiency: **99.2%**
  - P10 Efficiency: **89.4%**

## Submission Checklist

- [ x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-08 20:48:42 +00:00
Illia Silin
b05040b919 [rocm-libraries] ROCm/rocm-libraries#7111 (commit 651947f)
[CK] Fix latest batch of staging compiler warnings

## Motivation

Suppress the new batch of clang lifetimebound and invalidation warnings
with the latest staging compiler.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-08 14:15:31 +00:00
Yi DING
41064d8684 [rocm-libraries] ROCm/rocm-libraries#7141 (commit 37e40c3)
[CK_TILE] Fix typo in fmha_fwd_kernel K-dram unmerge tuple
 sizes (#7141)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

The qr_async_trload K-dram lambda's `else (XorLengthFold == 1)` branch
in `fmha_fwd_kernel.hpp` writes the outer-tile dim of its 3-tuple
unmerge/xor/merge as

```cpp
number<FmhaPipeline::kQKHeaddim / kDramTileK / FmhaPipeline::kAlignmentK>{}
```

which divides one extra time. For every fp16/bf16 hdim=128 configuration
the outer length collapses to **0**, e.g. `128 / 128 / 8 == 0`. The
3-tuple product no longer equals `kQKHeaddim`, so unmerge → xor → merge
stops round-tripping the head dimension.

This bug was masked by the async-load path: it only walks the descriptor
via stride and silently absorbs a length=0 outer dim. Any consumer that
actually traverses the descriptor (e.g. the TDM path on gfx1250)
immediately faults on the resulting `tuple<int, constant<0>>`.

The fix drops the extra `/ kAlignmentK` in all three call sites in the
same lambda so the outer dim becomes `kQKHeaddim / kDramTileK` and the
product is restored to `kQKHeaddim`. Strides are unaffected, so the
async path is bit-identical.

| Config (fp16/bf16) | hdim | kDramTileK | kAlignmentK | a (typo) | a
(fixed) | product (typo) | product (fixed) |
|---|---|---|---|---|---|---|---|
| hdim128, kKLoadOnce  | 128 | 128 | 8 | 0 | 1 | **0** | **128** |
| hdim128, kK0=32      | 128 |  32 | 8 | 0 | 4 | **0** | **128** |
| hdim64,  kKLoadOnce  |  64 |  64 | 8 | 0 | 1 | **0** | **64**  |
| hdim256, kK0=32      | 256 |  32 | 8 | 1 | 8 | **32** | **256** |

Bug introduced in 2cc0af6a815a (PR #2888 \"[CK_TILE] FMHA FWD bug
fix\"), where the original 2-tuple unmerge was generalized to a 3-tuple
and the typo slipped in.

## Test plan

- [x] Built `test_ck_tile_fmha_fwd` (umbrella, 5 gtest binaries) on
gfx950 native at develop b3bdc63a509 with `dev-gfx950` preset (clang 22,
ROCm 7.2.2). Compiles cleanly with `-Werror -Weverything`.
- [x] Ran `ctest -R test_ck_tile_fmha_fwd` on gfx950 native, baseline vs
patched: identical pass/fail (3 pass / 2 fail), identical failing case
set (114 gtest fails + 2 GPU memory access faults, all in pre-existing
fp16/bf16 group-mode `Alibi`/`Dropout` cases that reproduce on develop
without this patch). Total wall time 403s → 393s. Per-case latency drift
±8% (noise).
- [x] CI to verify on other gfx9 / gfx11 architectures.
2026-05-08 08:51:33 +00:00
Illia Silin
cc29502c28 [rocm-libraries] ROCm/rocm-libraries#6933 (commit ac8b7d9)
[CK] Filter out unsupported targets.

## Motivation

Filter out any unsupported targets, e.g., gfx900, gfx906, gfx90c, from
the GPU_TARGETS or GPU_ARCHS lists.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-08 03:24:33 +00:00
Illia Silin
837fb379d9 [rocm-libraries] ROCm/rocm-libraries#7138 (commit 70e6660)
[CK] disable tile_engine by default, limit gfx1030 CI builds
 to develop only. (#7138)

## Motivation

An attempt to reduce the build time and keep CI moving faster.
Disable tile_engine by default since even the cmake step may take up to
30 minutes.
Since we're down to a single gfx1030 CI node, use it only for develop
builds.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-08 01:41:00 +00:00
Chao
1d1be9e3de [rocm-libraries] ROCm/rocm-libraries#6529 (commit 93a6097)
[CK_TILE] Enable V3 persistent kernel dispatch for FMHA
 forward on gfx950 (#6529)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

[CK_TILE] Enable V3 persistent kernel dispatch for FMHA forward on
gfx950

## Motivation

Enable the existing V3 persistent kernel path for CK-Tile FMHA forward
on
gfx950 (MI350X/MI355X). The V3 kernel and codegen infrastructure already
exist but are disabled via hardcoded `F_is_v3_enabled=False`.

This change replaces the compile-time gate with a runtime environment
variable
`CK_FMHA_ENABLE_V3=1` (disabled by default, opt-in). When enabled:
- **Prefill** workloads (seqlen_q > 1) dispatch to V3 persistent
pipeline
- **Decode** workloads (seqlen_q == 1) always use V2 (memory-bound,
better suited)

The V3 persistent kernel uses grid-stride scheduling, XCD-interleave
tile
assignment for L2 locality, LPT reversal for causal masks, and gfx950
async
buffer loads.

## Technical Details

Single file: `example/ck_tile/01_fmha/codegen/ops/fmha_fwd.py`
- Add `#include <cstdlib>` and `<string>` for `std::getenv`
- Replace `{F_is_v3_enabled}` template parameter with runtime env var
check
- Add `seqlen_q > 1` guard (decode always uses V2)
- Remove `.format()` call in `write_fwd_api()`

## Dependencies

Depends on https://github.com/ROCm/rocm-libraries/pull/6501 — builds on
XCD-interleave and LPT scheduling infrastructure.

## Test Plan

- GPU validation on MI300X (gfx942, ROCm 6.4.1):
- Command: `./build/bin/tile_example_fmha_fwd -b=2 -h=8 -s=4096 -d=128
-prec=bf16 -v=1 -warmup=1 -repeat=3`
- GPU validation on MI350X (gfx950, ROCm 7.0):
- Command (V2): `./build/bin/tile_example_fmha_fwd -b=2 -h=8 -s=4096
-d=128 -prec=bf16 -v=1 -warmup=1 -repeat=3`
- Command (V3): `CK_FMHA_ENABLE_V3=1 ./build/bin/tile_example_fmha_fwd
-b=2 -h=8 -s=4096 -d=128 -prec=bf16 -v=1 -warmup=1 -repeat=3`
- Command (decode, always V2): `./build/bin/tile_example_fmha_fwd -b=64
-h=32 -h_k=8 -s=1 -s_k=4096 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1
-repeat=3`

## Test Result

Benchmark results (MI350X, gfx950, ROCm 7.0):

| Config | V2 (TFlops) | V3 (TFlops) | Speedup |
|--------|-------------|-------------|---------|
| Non-causal b=2 h=8 hk=2 s=4096 d=128 bf16 | 696.3 | 884.2 | **+27.0%**
|
| Causal b=2 h=8 hk=2 s=4096 d=128 bf16 | 371.3 | 494.9 | **+33.3%** |
| GQA b=2 h=32 hk=8 s=2048 d=128 bf16 | 671.3 | 831.7 | **+23.9%** |
| LLaMA-70B b=1 h=64 hk=8 s=4096 d=128 bf16 | 761.5 | 927.3 | **+21.8%**
|
| Causal GQA b=2 h=32 hk=8 s=2048 d=128 bf16 | 345.4 | 631.9 |
**+82.9%** |
| Long-seq b=1 h=16 s=16384 d=128 bf16 | 797.8 | 969.9 | **+21.6%** |
| Decode b=64 h=32 hk=8 s=1 s_k=4096 bf16 | 1828 GB/s | — (V2 path) |
unaffected |

Benchmark results (MI300X, gfx942, ROCm 6.4.1):

V3 has 0% effect on MI300X — V3 relies on gfx950 async buffer loads and
falls back to the V2 code path on gfx942. No regression on any config.

| Config | TFlops / GB/s | Time (ms) | Delta vs baseline |
|--------|-------------|-----------|-------------------|
| MHA bf16 b=2 h=8 s=4096 d=128 | 342.98 TFlops | 0.401 | +0.1% |
| MHA fp16 b=2 h=8 s=4096 d=128 | 411.18 TFlops | 0.334 | +4.9% |
| Causal MHA bf16 b=2 h=8 s=4096 d=128 | 232.61 TFlops | 0.296 | +2.4% |
| GQA 4:1 bf16 b=2 h=32 hk=8 s=2048 d=128 | 320.07 TFlops | 0.429 |
-1.4% |
| GQA 8:1 bf16 b=2 h=64 hk=8 s=2048 d=128 | 353.91 TFlops | 0.777 |
+1.7% |
| LLaMA-70B prefill b=1 h=64 hk=8 s=4096 d=128 bf16 | 381.53 TFlops |
1.441 | +1.2% |
| Long-seq bf16 b=1 h=16 s=16384 d=128 | 388.61 TFlops | 5.659 | +1.4% |
| Decode b=64 h=32 hk=8 s_k=4096 d=128 bf16 | 693.40 GB/s | 1.550 |
+0.3% |

All validation tests pass (`valid:y`) on both MI300X and MI350X.

Additional validation:
- `CK_FMHA_ENABLE_V3=0` correctly falls back to V2 (default behavior
unchanged)
- `CK_FMHA_ENABLE_V3=1` dispatches to V3 for prefill, V2 for decode
- Validation passes across fp16/bf16, batch/group mode,
causal/non-causal
- No regression on decode path
2026-05-07 16:23:19 +00:00
Linjun-AMD
33b62ed087 [rocm-libraries] ROCm/rocm-libraries#6914 (commit b791478)
[CK_TILE][FMHA] Fix sink un-mask under right-window and emit
 fp8bf16 batch_prefill sink kernels (#6914)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

Two related fixes to `ck_tile` FMHA so that StreamLLM-sink +
sliding-window
  batch-prefill works correctly for fp8 KV / bf16 compute.

  Review the commits in this order:

  1. `fmha: emit sink kernels for fp8bf16 batch_prefill`
Extends `example/ck_tile/01_fmha/codegen/ops/fmha_batch_prefill.py` so
     the fp8(KV) / bf16(QO) batch-prefill codegen also emits the
`mask=mask_enum::generic_with_sink` variant. Without this the runtime
     could not dispatch to a sink-aware kernel for the fp8bf16 path.

  2. `fmha: respect right-window in IsOutOfSinkBound`
The sink un-mask in `GenericAttentionMask::IsOutOfSinkBound` (local-mask
branch) used `(i_y + x) > 1` as the gate, which conditioned on the row
     index instead of the column index. As a result, queries `1..sink-1`
could attend to *future* sink positions (violating causal /
right-window),
while query `0` fell back to the plain causal mask. The fix replaces the
guard with `i_x < i_y + x` so every query only sees sink columns up to
     its own right-window boundary.

  3. `fmha: clarify IsOutOfSinkBound predicate comment`
Doc-only follow-up that rewrites the comment above the predicate as a
     clause-by-clause explanation (`i_x < sink`, `i_x < i_y + x`,
     `y < y_total`, `i_y < x_total`).

  ## Test plan

- [x] Repro on aiter `op_tests/test_batch_prefill.py` (fp8 +
bf16_dequant
        modes with `sink=4`, `win_left=1023`, `softcap=0.0`, `sal=True`)
        now passes for all parametrized shapes.
- [x] Existing fp16/bf16 batch-prefill paths (no sink) unchanged —
codegen
diff only adds the `generic_with_sink` variant for fp8bf16; existing
        kernel object lists unaffected.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-07 02:40:45 +00:00
Yi DING
207a95d5e4 [rocm-libraries] ROCm/rocm-libraries#6152 (commit 36b016a)
[CK_TILE] Use Unified Workspace for FMHA BWD
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation
`dq_acc` is the intermediate accumulation buffer used in FMHA backward
pass for deterministic mode. The current implementation allocates it as
a **single rectangular tensor**:

```
shape = [shape_batch, nhead, nsplits, shape_seqlen_q, hdim_q]
```

where `nsplits = launcher.dq_acc_splits` (a single scalar), computed
from `max_seqlen_k` and shared across all batches.

### Problems

1. **Memory waste**: In group mode, each batch may have a different
`seqlen_k`, but `nsplits` is computed from `max_seqlen_k`, causing
batches with shorter `seqlen_k` to over-allocate in the split dimension.

2. **Interface coupling**: `fmha_bwd_args` exposes internal layout
details such as `stride_dq_acc`, `nhead_stride_dq_acc`,
`batch_stride_dq_acc`, and `split_stride_dq_acc`. The caller is
responsible for computing these strides, but this logic belongs inside
the kernel.

### Goals

1. Switch `dq_acc` buffer to a **compact layout**: batches are
concatenated contiguously, with each batch occupying `nhead * nsplits_i
* seqq_i * hdim_q` elements (nhead outermost).
2. **Remove all `*_stride_dq_acc` fields** from `fmha_bwd_args`,
replacing them with a single `workspace_ptr`; the kernel splits this
internally using a fixed layout.
4. `fmha_bwd_launcher` provides a **workspace management interface**:
the caller only needs to allocate GPU memory and call
`prepare_workspace()` — no layout computation required.
5. **Isolate kernel internals from the caller API**: the `dq_acc` layout
(nsplits, strides, buffer size) is determined entirely inside the
launcher/kernel. Future changes to block shape, pipeline type, or
persistent kernel strategy require no modifications to the caller's
`fmha_bwd_args` or workspace allocation logic.

## Technical Details

### Interface Design

#### New fields in `fmha_bwd_traits`

```cpp
struct fmha_bwd_traits
{
    int seqlen_q;
    int seqlen_k;
    int batch;
    int max_seqlen_q;
    int max_seqlen_k;
    int hdim_q;
    int hdim_v;
    int nhead_q;
    int nhead_k;
    std::string data_type;
    bool is_group_mode;
    mask_enum mask_type;
    bias_enum bias_type;
    bool has_dbias;
    bool has_dropout;
    bool is_store_randval;
    bool is_deterministic;
    // New: cumulative physical seqlen pointers for group mode (pass nullptr for batch mode).
    // seqstart_qs[i+1] - seqstart_qs[i] = physical seqlen_q of batch i (including padding); length = batch+1
    // seqstart_ks[i+1] - seqstart_ks[i] = physical seqlen_k of batch i (including padding); length = batch+1
    const int* seqstart_qs = nullptr;
    const int* seqstart_ks = nullptr;
};
```

#### `fmha_bwd_launcher` actual structure

```cpp
struct fmha_bwd_launcher
{
    std::function<float(fmha_bwd_args, const ck_tile::stream_config&)> run{};

    // Total workspace size in bytes (host_ws_size + device_ws_size), computed by init().
    // Zero for kUseQrQtrDorPipeline (writes dq directly, no acc buffer needed).
    size_t workspace_size = 0;

    fmha_bwd_launcher(const fmha_bwd_traits&);

    // Copies auxiliary data (nsplits[], offsets[]) via hipMemcpy to the head of the GPU workspace,
    // and zeros the dq_acc buffer portion (tail of workspace) if required.
    // The memory pointed to by device_ws must be >= workspace_size bytes.
    std::function<void(void* device_ws)> prepare_workspace{};

    template <typename... Args>
    float operator()(Args&&... args) const { return run(std::forward<Args>(args)...); }

private:
    size_t host_ws_size   = 0;  // CPU workspace size (nsplits[] + offsets[] arrays)
    size_t device_ws_size = 0;  // GPU-only data size (dq_acc buffer)
    std::unique_ptr<char[]> ws_host;  // host-side workspace buffer

public:
    template <typename T0, typename T1, typename T2, typename Arch>
    void init(const fmha_bwd_traits& traits);
};
```

The `init<>()` template method (invoked by codegen dispatch branches as
`this->init<...>(t)`) is responsible for:
1. Setting the `run` lambda
2. Calling `FmhaBwdDQDKDVKernel::GetWorkspaceHostSize(batch)` to obtain
`host_ws_size`
3. Allocating `ws_host` (host memory)
4. Calling `FmhaBwdDQDKDVKernel::PrepareWorkspaceHost(ws_host.get(),
...)` to fill nsplits/offsets; return value is `device_ws_size`
5. `workspace_size = host_ws_size + device_ws_size`
6. Setting the `prepare_workspace` lambda (captures `this`, calls
`PrepareWorkspaceDevice`)

When no kernel matches the given traits, both `run` and
`prepare_workspace` are initialized to default lambdas that print a
warning to `std::cerr` and return gracefully (no exception).

#### Workspace overall layout

The workspace is managed by `FmhaBwdWorkspaceManager` and consists of
two segments:

```
Offset 0 (CPU-prepared segment, host_ws_size bytes; also hipMemcpy'd to the head of GPU workspace):
  index_t nsplits[batch or 1]       — per-batch nsplits array
                                      group mode: batch elements
                                      batch mode / non-deterministic: 1 element
  [group mode only] long_index_t dq_acc_offsets[batch+1]
                                    — per-batch element offset (inclusive prefix sum)
                                      offsets[0]=0, offsets[i+1] = offsets[i] + nhead*nsplits_i*seqq_i*hdim_q

Offset host_ws_size (device data segment, device_ws_size bytes):
  AccDataType dq_acc[total_elements] — compact dq_acc buffer (zeroed if required)
                                       total_elements = sum_i(nhead * nsplits_i * seqq_i * hdim_q)
                                       layout within each batch: [nhead, nsplits_i, seqq_i, hdim_q]
                                       note: seqq_i uses the physical length (including padding)
```

Alignment constant (`ALIGNMENT = 16`):
```
nsplits_size  = align_up(sizeof(index_t) * N, 16)          // N = batch (group) or 1 (batch/non-det)
offsets_size  = align_up(sizeof(long_index_t) * (batch+1), 16)  // group mode only
host_ws_size  = nsplits_size + offsets_size
dq_acc_offset = host_ws_size  // GetDqAccDataOffset(batch)
```

**Key benefits**:
- The kernel reads nsplits/offsets directly from the workspace head — no
device-side recomputation.
- `FmhaBwdConvertQGradKernel` is completely decoupled from the pipeline
block shape (`kN0`): nsplits is read from `nsplits_ptr`, `kN0` is no
longer a template parameter, and multiple dq_dk_dv tiles with different
`F_bn0` values now share a single convert_dq kernel instance (under
receipt 1/2, deterministic convert_dq kernel count drops from ~300 to
60).
- nsplits/offsets are computed on the host and transferred in one
`hipMemcpy`; the dq_acc buffer follows immediately, at the offset given
by `GetDqAccDataOffset`.

#### Workspace size by scenario

| Scenario | `workspace_size` | Notes |
|----------|-----------------|-------|
| **kUseQrQtrDorPipeline** (any mode) | `0` | Writes dq directly; no acc
buffer; `PrepareWorkspaceHost` returns 0 |
| **Non-deterministic + batch mode** | `> 0` | nsplits[1]=1; dq_acc used
for atomic add; `workspace_size = host_ws_size +
batch*nhead*seqlen_q*hdim_q*ebytes` |
| **Non-deterministic + group mode** | `> 0` | nsplits[1]=1; dq_acc
contiguous layout; `workspace_size = host_ws_size +
nhead*seqstart_qs[batch]*hdim_q*ebytes` |
| **Deterministic + group mode** | `> 0` | nsplits[batch],
offsets[batch+1], compact dq_acc; nsplits_i computed independently per
batch |
| **Deterministic + batch mode persistent** | `> 0` | nsplits[1]
(uniform across batches); dq_acc `batch*nhead*nsplits*seqlen_q*hdim_q` |

**NeedsZeroDqAcc** (determines whether `PrepareWorkspaceDevice` calls
`hipMemset`):
- Persistent kernel (deterministic batch mode) or non-deterministic:
**must zero** (atomic add requires zero initialization)
- Deterministic group mode + no mask: **no zeroing needed** (every tile
writes its full region)
- Deterministic + with mask: **must zero** (some blocks are skipped,
leaving uninitialized tiles that would contribute to the reduction)

#### Caller usage

```cpp
// 1. Create launcher (traits include seqstart_qs/ks pointers; workspace_size is computed during construction)
fmha_bwd_launcher launcher(fmha_traits);

// 2. Read launcher.workspace_size directly
const auto ws_size = launcher.workspace_size;

// 3. Allocate a single GPU workspace
ck_tile::DeviceMem ws_buf(ws_size);

// 4. Copy nsplits/offsets to GPU head and zero dq_acc if required
launcher.prepare_workspace(ws_buf.GetDeviceBuffer());

// 5. Build args with a single workspace pointer; the kernel splits it internally
fmha_bwd_args args{
    ...,
    ws_size > 0 ? ws_buf.GetDeviceBuffer() : nullptr,  // workspace_ptr
};
launcher(args, stream_config);
```
2026-05-07 02:23:28 +00:00
Illia Silin
250c29f914 [rocm-libraries] ROCm/rocm-libraries#7046 (commit aaf7665)
[CK] fix CI git token.

## Motivation

Fix the CI breakage due to git PAT deprecation.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-06 02:32:37 +00:00
Gino Lu
b00e5449c8 sparse_attn: split KStats kernel, add README + perf charts
- Split SpargeKStatsKernel/Pipeline out of BlockMap (Kernel A produces
  per-block K stats workspace consumed by Kernel B), removing redundant
  K-stat recomputation across Q-blocks.
- Add example/ck_tile/50_sparse_attn/README.md (status vs upstream pinned
  to ae5b629, unported items, usage, references).
- Add example/ck_tile/50_sparse_attn/docs/{speedup_vs_sparsity,kernel_breakdown}.png
  + reusable plot_sparge_perf.py (b=2 h=32 s=16384 d=128 fp16 perf snapshot).

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-05-05 03:13:24 -04:00
Jeff Huang
10cb6916c3 [rocm-libraries] ROCm/rocm-libraries#6932 (commit ce3e67b)
[CK] Fix OOB page table read in batch_prefill V prefetch
 (AICK-1171) (#6932)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

Fix a GPU memory access fault in `mha_batch_prefill` triggered when the
per-batch page table is tightly sized (no trailing slack).

**Affected configurations:**
- All FMHA batch prefill V2 kernels
(`block_fmha_batch_prefill_pipeline_qr_ks_vs_async`)
- Triggered by paged KV layouts where `kv_page_indices.numel() ==
ceil(seqlen_k / page_size)` exactly
- Manifests as: `Memory access fault by GPU node-X (Agent handle:
0x...)` followed by `Aborted (core dumped)`
- Silent corruption (no fault, wrong output) when the OOB read happens
to land in zero-initialized memory

### Root cause

`load_physical_pages` performs **lookahead reads** on the page table to
prefetch K/V tiles for the next iteration. When the page table for a
batch has exactly `N` entries, the V-tile prefetch indexes `page_idx[N]`
(one past the last valid entry), reading either uninitialized memory or
the next batch's slot. On gfx942 with a tightly-sized page table, the
read crosses into an unmapped page and triggers an HSA page fault.

The bug was masked in earlier testing because most test harnesses pad
`kv_page_indices` with trailing zeros — OOB reads then return `page_id =
0`, a valid in-cache page, producing silent numerical drift instead of a
fault.

### Fix design

Thread `max_page_table_idx = (seqlen_k - 1) / page_size` from the kernel
layer down to `load_physical_pages`, and clamp every page-table read
with `ck_tile::min()`. Applied to **all four code paths** in the V
prefetch:

| Branch | What it does | Clamp applied |
|--------|-------------|---------------|
| `kIsKcache` | K prefetch loop | `min(global_token_idx >>
kLog2PageSize, max_page_table_idx)` |
| V LINEAR (`page_size == 1`) | One token = one page |
`min(global_token_idx, max_page_table_idx)` |
| V crosses pages (`kVTileCrossesPages`) | Per-thread page lookup |
`min(global_token_idx >> kLog2PageSize, max_page_table_idx)` |
| V single page (lane0 broadcast) | `readfirstlane`-uniform lookup |
`min(... >> kLog2PageSize, max_page_table_idx)` |

### Key design decisions

**Mandatory parameter, not optional with a sentinel default.** An
optional `max_page_table_idx = INT32_MAX` default would let the bug
silently come back at any new callsite that forgets to pass it. Making
it mandatory forces every caller to opt in explicitly and surfaces
missed callsites at compile time.

**`seqlen_k == 0` clamps to 0** instead of underflowing `(0 - 1) /
page_size` to `-1`. The empty-batch case is rare but well-defined: clamp
every read to slot 0.

**Single computation in the kernel layer.**
`FmhaBatchPrefillWithPagedKVCacheKernel` computes `max_page_table_idx`
once per batch and forwards it through every QScale branch (PERTENSOR /
KV_BLOCKSCALE / default). All three `operator()` overloads of the
pipeline (rich, default forwarder, KV_BLOCKSCALE forwarder) take and
forward the parameter.

### Files changed

| File | Change |
|------|--------|
| `include/ck_tile/ops/fmha/kernel/fmha_batch_prefill_kernel.hpp` |
Compute `max_page_table_idx` per batch, forward to all 3 QScale branches
|
|
`include/ck_tile/ops/fmha/pipeline/block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp`
| Add `max_page_table_idx` to `load_physical_pages` and 3 `operator()`
overloads; clamp page-id reads in 4 code paths |

## Test plan

- [x] AICK-1171 reproducer verified on MI-308X (gfx942)
- [x] New pytest case `test_batch_prefill_aick1171_oob_page_table_read`
in aiter, parametrized over `total_blocks ∈ {160, 164, 168, 176, 208,
256}` (matches the `crash1_r8_*` bisect family)
- [x] Full FMHA batch prefill suite on gfx942 + gfx950

## Linked issue

AICK-1171.
2026-05-05 06:29:51 +00:00
Yaswanth Raparti
af02240be8 [rocm-libraries] ROCm/rocm-libraries#6912 (commit c705da2)
[CK] Reduce per-file logging in cmake_dependency_analyzer
 (#6912)

## Motivation

Current progress_callback function generates large volume of prints
which creates noise in seeing actual CI failure logs.
Only emit a progress line at the completion of each stage to avoid
massive logs from the per-source-file extracting_dependencies callback.

## Technical Details

Update the `progress` function to print only at the completion of each
stage.
https://github.com/ROCm/rocm-libraries/pull/6912/changes#diff-15971b83c7dfefb48fd788507a923017d93bbd9487ed6aeb414ad2c5e00be934R720

## Test Plan

to be tested in CI

## Test Result

to be tested in CI

## Submission Checklist

- [x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-01 08:13:45 +00:00
John Shumway
f3d8a1269a [rocm-libraries] ROCm/rocm-libraries#6972 (commit 8761b90)
[CK] Dockerfile: auto-discover latest TheRock nightly tarball
 (#6972)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Our docker containers with `--build-arg compiler_version=therock` should
have the latest nightly build of TheRock in `/opt/rocm`. When I looked
for `rocm_kpack` and other `kpack` artifacts, they were missing, and I
realized we had pinned the version by date. Instead, we should look for
the most recent linux-multiarch tarball.

## Summary
- Auto-discover the latest TheRock nightly tarball at Docker build time
instead of pinning a stale URL (previously hardcoded to a Feb 2026
nightly that predates kpack)
- Logic is to `wget` the directory, and identify the latest tarball
(alphabetically sorted by YYYYMMDD in filename).
- Support manual override via `--build-arg TARBALL_URL=...` for pinning,
and `--build-arg TARBALL_PATTERN=...` for selecting a specific arch
variant
- Fix sccache download URL: `/releases/latest/download/` was redirecting
to v0.15.0 but the filename referenced v0.14.0, causing a 404

## Test plan
- [x] Verified tarball discovery logic resolves to
`therock-dist-linux-multiarch-7.13.0a20260430.tar.gz`
- [x] Built Docker image locally with `--build-arg
compiler_version=therock`
- [x] Confirmed sccache installs successfully with the fixed URL
- [ ] Verify CI pipeline builds with the updated Dockerfile

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 06:42:14 +00:00
ltqin
de0a61e5c2 [rocm-libraries] ROCm/rocm-libraries#6574 (commit b3db057)
[CK_TILE] Add SageAttention v2 forward kernel with
 multi-granularity quantization (#6574)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

Add a CK_TILE forward kernel implementing [SageAttention
v2](https://arxiv.org/abs/2411.10958) — an attention algorithm that
applies multi-granularity quantization to Q/K/V before computing
attention, trading minimal accuracy loss for higher throughput on
low-precision hardware.

### Quantization design

| Tensor | Supported data types | Scale granularity options |
|--------|---------------------|--------------------------|
| Q | fp8 / int8 / int4 | per-tensor, per-block (128 tokens), per-warp
(32 tokens), per-thread (4 tokens) |
| K | fp8 / int8 / int4 | per-tensor, per-block (128 tokens), per-warp
(64 tokens), per-thread (16 tokens) |
| V | fp8 | per-channel (always) |
| O | bf16 | — |

Three precision combinations are supported: `fp8/bf16` (QKV fp8, O
bf16), `i8/fp8/bf16` (QK int8, V fp8, O bf16), and `i4/fp8/bf16` (QK
int4, V fp8, O bf16).

### Architecture support

- **gfx9** (CDNA2/3, e.g. gfx90a, gfx942) — full tile set
- **gfx950** (CDNA4) — restricted tile set (N-per-block capped at 64 for
fp8-family dtypes)

### Implementation

- Two pipeline variants: `QRKSVS` (synchronous) and `QRKSVS_ASYNC`
(async copy)
- Masking support: no mask, causal (top-left / bottom-right), and
generic windowed
- Batch and group (variable-length) modes
- Head dimension: d=128, d_v=128
- Python codegen under `example/ck_tile/49_sageattention/codegen/`
generates kernel instances per target/dtype/tile combination
- Smoke tests included via `tile_example_sageattn_fwd`

### Test commands

\`\`\`bash
# fp8 QKV
./build/bin/tile_example_sageattn_fwd -v=1 -b=16 -h=8 -s=1024 -d=128
-kname=1 -prec=fp8bf16 -qscale=3 -init=3

# int8 QK, fp8 V
./build/bin/tile_example_sageattn_fwd -v=1 -b=16 -h=8 -s=1024 -d=128
-kname=1 -prec=i8fp8bf16 -qscale=3 -init=3
\`\`\`

\`-qscale\` values: 1=per-tensor, 2=per-block, 3=per-warp, 4=per-thread
2026-04-30 18:33:36 +00:00
Illia Silin
e8d64ad5c6 [rocm-libraries] ROCm/rocm-libraries#6741 (commit 0d4180f)
[CK] restore fmha performance reporting and disable c++17 in
 CI. (#6741)

## Motivation

This change restores monitoring of FMHA benchmarks performance in daily
builds and removes the std=c++17 flag from CI builds on gfx90a.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-25 02:23:10 +00:00
Qianfeng
865ab2b8ed [rocm-libraries] ROCm/rocm-libraries#6209 (commit 89c9f3e)
Improve the performance of qr_ks_vs_whole_k_prefetch pipeline
 (#6209)

## About qr_ks_vs_whole_k_prefetch pipeline
This PR updates and enhances the qr_ks_vs_whole_k_prefetch pipeline to
improve performance on both MI350 GPUs through better MFMA instruction
usage, transposed V-loading support, and N0-loop implementation. The
pipeline targets scenarios where the number of workgroups is low,
enabling better CU occupancy by using smaller MTile sizes (kM0=64 vs
128) while prefetching entire K tiles.

## Changes:

- Adds transposed V-loading support (qr_ks_vs_whole_k_prefetch_trload)
to avoid using shuffle instructions on MI350
- Implements N0-loop based Gemm0 to reduce tile window movement overhead
and eliminate `clear_tile` calls
- Adds full support for hdim96/hdim160 without padding requirements
- Updates MFMA instruction selection to ensure optimal choices for MI350

## Performance results

1. For attention shapes which leads to kM0=64,
`qr_ks_vs_async_whole_k_prefetch_trload` shows much better performance
than `qr_ks_vs_async_trload` on the same case (execution time `41.02ms`
by whole_k_prefetch_trload & `58.50ms` by async_load), and
`qr_ks_vs_async_whole_k_prefetch_trload` also shows obviously better
performance than the recently tuned `qr_ks_vs_async` on the same case
(execution time `41.02ms` by whole_k_prefetch_trload 7 `47.60ms` by
qr_ks_vs_async)
2. Also on MI300, for attention shapes which leads to kM0=64,
`qr_ks_vs_async_whole_k_prefetch` shows much better performance than the
`qr_ks_vs_async` (which is supposed to be very high-efficient) on the
same case (execution time `64.50ms` by whole_k_prefetch & `80.20ms` by
qr_ks_vs_async)
3. For attention shapes which leads to kM0=128,
`qr_ks_vs_async_whole_k_prefetch_trload` show a little bit better
performance than `qr_ks_vs_async` on mi350 (execution time `104.50ms` by
whole_k_prefetch_trload & `106.50ms` by qr_ks_vs_async). And they shows
completely on-par performance on MI300

## Test/Verify

1. Use the ROCM xformers branch `test_whole_k_prefetch_n0loop` to
test/verify qr_ks_vs_whole_k_prefetch pipeline since this pipeline can
not be used by ck_tile fmha example so far
2.  Use the following command-line for building/testing xformers
>```bash
> #> git clone -b test_whole_k_prefetch_n0loop
https://github.com/ROCm/xformers
> #> git submodule update --init --recursive
> #> pip  install --no-build-isolation -e ./
> #> pytest tests/test_mem_eff_attention.py::test_forward
>```
4. Any scripts which can run on xformers can be used to evaluate
qr_ks_vs_whole_k_prefetch pipeline. Using the two environ variable to
switch from using different pipelines
> ```bash
> #> export FMHA_DISABLE_SPECIAL_TREATMENT=1 #> to disable using FAV3
and qr_ks_vs_async_trload pipeline
> #> export FMHA_ENABLE_ASYNC_PIPELINE=1 #> to disable using
qr_ks_vs_async pipeline for comparing
> ```

## Discussion
2026-04-24 16:31:59 +00:00
Gino Lu
eca3cb3e0a sparse_attn: add bm0 dispatch for sparge blockmap compatibility
Add bm0 field to fmha_jenga_fwd_traits so callers can specify the
preferred Q-tile size. Codegen now emits separate tile configs for
bm0=64 (sparge blockmap) and bm0=128 (original), with CppConstraint
guards to select the right kernel at runtime.

End-to-end test passes for both jenga and vsa paths. Performance is
known to be suboptimal at this stage; tile sizes and warp counts for
the bm0=64 path have not been tuned.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-24 05:13:51 -04:00
Yi DING
b2ea5fd315 [rocm-libraries] ROCm/rocm-libraries#6701 (commit f9a8d1c)
[CK] Fix CI Failures for PR From Forks
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

Fork PRs fail CI when `RUN_AITER_TESTS` or `RUN_FA_TESTS` is enabled.
The docker scripts run `git clone -b "$CK_*_BRANCH"
https://github.com/ROCm/rocm-libraries.git`, but a fork's branch doesn't
exist upstream:

```
fatal: Remote branch <fork-branch> not found in upstream origin
```

Example: [PR #6529 build
#4](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-6529/4/pipeline).

## Technical Details

**`Jenkinsfile`** — for PRs, use the upstream-visible PR ref instead of
the head branch name:

```groovy
CURRENT_BRANCH_NAME = env.CHANGE_ID
    ? "refs/pull/${env.CHANGE_ID}/head"
    : (env.CHANGE_BRANCH ? env.CHANGE_BRANCH : env.BRANCH_NAME)
```

**`Dockerfile.aiter` / `Dockerfile.fa`** — `git clone -b <ref>` only
accepts branches (`refs/heads/*`) and tags (`refs/tags/*`), so it can't
resolve `refs/pull/N/head`. Switch to `git fetch`, which accepts any
refspec (and still works for plain branch names):

```sh
mkdir rocm-libraries && cd rocm-libraries
git init -q
git remote add origin https://github.com/ROCm/rocm-libraries.git
git fetch --depth 1 --filter=blob:none origin "$CK_*_BRANCH"
git sparse-checkout init --cone
git sparse-checkout set projects/composablekernel
git checkout FETCH_HEAD
```

`git checkout FETCH_HEAD` lands in detached HEAD, which breaks the
existing `git branch -m "$CK_*_BRANCH"` (and that name isn't a valid
local branch anyway). Decouple the local branch name from the upstream
ref:

- Replace `git init` + `git branch -m` with `git init -b
"$LOCAL_BRANCH"` (requires git ≥ 2.28, satisfied by base images)
- `LOCAL_BRANCH="ck-import-${ROCM_LIBRARIES_SHA}"` in the rocm-libraries
path; `LOCAL_BRANCH="$CK_*_BRANCH"` in the fallback
- Downstream `git clone -b ... ../ck` uses `$LOCAL_BRANCH`

## Test Plan

Manually trigger a build on this PR with `RUN_AITER_TESTS=true` and
`RUN_FA_TESTS=true`; both docker images should build end-to-end.

## Test Result
[jenkins / rocm-libraries-folder/Composable Kernel / PR-6701 /
#3](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-6701/3/pipeline/)

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-24 08:23:24 +00:00
Jeff Huang
fdf4bb7fcc [rocm-libraries] ROCm/rocm-libraries#6653 (commit 1df887e)
[CK_TILE] fix(fmha): support >2GB KV cache in batch prefill
 via template dispatch (#6653)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Motivation

The CK batch prefill kernel previously failed (silent overflow + page
faults) when the KV cache exceeded 2 GB, blocking long-context inference
workloads (e.g., 128K+ token contexts with paged KV).

Two distinct failure modes were addressed:

1. **>4GB SRD overflow (`page_size < kN0`):** The SRD
`buffer_load_dwordx4` path uses a 32-bit `voffset` register; for small
page sizes the rebased SRD spans the full KV pool and the offset wraps
past 2 GB, corrupting K/V loads.
2. **gfx950 page-table fault (`page_size >= kN0`):** On CDNA4 the
hardware validates the **full SRD `num_records` range** against
page-table permissions (CDNA3 only checks per-instruction `voffset`).
After per-tile SRD rebase, an un-trimmed `num_records` field extends
past the live page and faults on freed/protected memory.

## Technical Details

**Two-mode `tile_scatter_gather` selected by the `kUseGlobalLoad`
template parameter:**

| Case | `page_size` | KV cache size | Mode | Load path | Addressing |
|---|---|---|---|---|---|
| 1 | `>= kN0` (large pages) | any | SRD (`kUseGlobalLoad=false`) |
`buffer_load_dwordx4` | 32-bit `voffset`, bounded by per-page rebase |
| 2 | `< kN0` (small pages) | `<= 2 GB` | SRD (`kUseGlobalLoad=false`) |
`buffer_load_dwordx4` | 32-bit `voffset`, fits in INT32 byte range |
| 3 | `< kN0` (small pages) | `> 2 GB` | Global-load
(`kUseGlobalLoad=true`) | `async_load_tile_raw_flat` (K) +
`load_tile_flat` (V) | 64-bit |

**Dispatch:** the auto-gen API layer (`fmha_batch_prefill.py`) selects
the kernel instantiation at launch from `(page_block_size,
num_total_pages * batch_stride_k * kElementBytes)`, so the small-page
penalty is paid only when correctness requires it.

**gfx950 SRD `num_records` trimming:** in the K and V rebase lambdas of
`block_fmha_batch_prefill_pipeline_qr_ks_vs_async`,
`set_bottom_tensor_view_buffer_size(page_stride_k/v)` is called after
each rebase to constrain `num_records` to the live page. Required for
CDNA4 page-table validation; harmless on CDNA3.

**Pipeline sync for the global-load path:**
- V uses synchronous `load_tile_flat`; K uses
`async_load_tile_raw_flat`.
- `v_physical_pages_current` is double-buffered so the V flat load
doesn't race against the next iteration's K rebase computation.

**Arch guards:** `global_load_lds` intrinsics are gated to `__gfx94__` /
`__gfx950__` (CDNA3+). Other architectures hit a `dependent_false`
static_assert with a descriptive message.

**Device-side assertion convention:** SRD setters use
`__builtin_assume(cond)` (hint-only) rather than `<cassert>`'s
`assert()`. The latter introduces an `__assert_fail` call whose register
pressure scatters the K-SRD scalar register window across conditional
branches, corrupting `buffer_load_dwordx4` on gfx950.

## Test Plan

Tested on both MI308 (gfx942) and MI355 (gfx950) via the aiter wrapper
test suite. All coverage lives in **`op_tests/test_batch_prefill.py`**:

- **Functional matrix (96 cases)** — `test_batch_prefill`: `page_size ∈
{1, 16, 1024}` × `kv_layout ∈ {linear, vectorized}` × `dtype ∈ {bf16,
fp8 quant variants}` × `causal` × `soft_cap` × `LSE` × `batch_size ∈ {1,
4}` (parametrized to exercise per-sequence SRD rebase across batch
boundaries).
- **>2 GB coverage** — `test_batch_prefill_large_kvcache`: extended to
allocate a 5 GB+ KV cache pool and exercise both `kUseGlobalLoad=true`
(small-page) and `kUseGlobalLoad=false` (large-page rebase) paths.
Includes both single-batch and multi-batch (`batch_size=4`) cases to
exercise per-sequence SRD rebase across the >2 GB pool.
- Numerical reference: PyTorch SDPA, per-batch loop with `atol` / `rtol`
from the existing batch prefill test harness.

## Test Result

| Arch | `test_batch_prefill` | `test_batch_prefill_large_kvcache` (>2
GB) |
|------|----------------------|---------------------|
| MI308 (gfx942) | All passed | Passed |
| MI355 (gfx950) | All passed | Passed |

**Performance impact (gfx950, hot SRD path):**
- +2.67% kernel-time on `seqlen=1024 / page_sz=1024 / bf16 / sglang /
causal / soft_cap=30`, attributable in full to the two
`set_bottom_tensor_view_buffer_size` calls in the K/V rebase lambdas
(5-run median, signal/noise ≈ 9×).
- This cost is **mandatory for gfx950 correctness** on >2 GB workloads —
removing the setters re-introduces page-faults.
- gfx942: 0 regressions in the same range (all configs ≤ +0.97%).

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-23 23:09:25 +00:00
Luo Cheng
2fae12cbbb [rocm-libraries] ROCm/rocm-libraries#6242 (commit f46ac14)
[CK] Fix out of bounds modifications caused by negative
 topk_ids in MoeSortingMultiPhaseKernel_P0_v1 (#6242)

## Motivation

Fix sglang randomly crash by filter negative topk ids.

## Technical Details

In sglang expert parallel mode, there may be idle batch (batch=0) fired,
it will reuse batch=1 resource in cuda graph mode. But in topk op, it
will set non used topk ids to -1, in idle batch case, all topk ids are
set to -1. In `MoeSortingMultiPhaseKernel_P0_v1` negative expert id will
cause overwrite somewhere and sglang may randomly crash.
Except idle batch case, if the captured batch sizes are discrete, there
may be -1 of expert id due to the similar logic.

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: zovonoir <jialzhu@amd.com>
2026-04-23 22:45:32 +00:00
Artem Kuzmitckii
281d1bf50b [rocm-libraries] ROCm/rocm-libraries#6132 (commit e97065d)
[CK] Fix divide-by-zero crash for grouped conv kernels
 (#6132)

## Motivation

During run pytorch unit tests for conv3d:
`test_dtypes_nn_functional_conv3d_cuda`,
`test_fake_crossref_backward_amp_nn_functional_conv3d_cuda_float32`
found divide-by-zero crash during CK kernel selection.

Refs ROCM-20764

## Technical Details

Add assert for K0PerBlock equal 0, also covered other potential places
related with k_batch calculation.

## Test Plan
Run miopen command extracted from mentioned test:
`MIOpenDriver convfp16 --spatial_dim 3 -I NCDHW -O NCDHW -f NCDHW -n 1
-c 1 -k 1 -g 1 --in_d 4 -H 4 -W 4 --fil_d 4 -y 4 -x 4 --pad_d 0 -p 0 -q
0 --conv_stride_d 2 -u 2 -v 2 --dilation_d 1 -l 1 -j 1 -m conv -F 4 -t
1`
## Test Result
Passed

## Submission Checklist

- [X] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Signed-off-by: Artem Kuzmitckii <artem.kuzmitckii@amd.com>
2026-04-23 20:12:40 +00:00
KateJu
793a59736a [rocm-libraries] ROCm/rocm-libraries#6656 (commit 1c958f8)
Fix per-layer conv2d int8 CPU verification reference path
 (#6656)

case example_conv2d_fwd_xdl_perlayer_quantization_int8.exe 1 0

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-23 14:09:58 +00:00
KateJu
34c7c3bbf2 [rocm-libraries] ROCm/rocm-libraries#6655 (commit 677b38d)
Add missing lds sync

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-04-23 14:06:27 +00:00
jakpiase
ad412c26f3 [rocm-libraries] ROCm/rocm-libraries#6624 (commit 47d0162)
[CK_TILE] Grouped Convolution Backward Data Direct Load
 (#6624)

## Proposed changes

Add Grouped Convolution Backward Data with Direct Load into
DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffleV3 device implementation.
This enables direct global memory loading (bypassing LDS) for the
backward data convolution path on gfx950, following the same pattern
used in both backward weight and forward convolution.

Direct load convolution backward data improves performance by avoiding
LDS round-trips for certain configurations on gfx950, which supports a
wider range of instructions. Currently correctness is checked only at
usage point, but should be extended to a standalone UT in the future.
2026-04-23 09:17:50 +00:00