mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-11 17:00:18 +00:00
[CK] Fix OOB page table read in batch_prefill V prefetch (AICK-1171) (#6932) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary Fix a GPU memory access fault in `mha_batch_prefill` triggered when the per-batch page table is tightly sized (no trailing slack). **Affected configurations:** - All FMHA batch prefill V2 kernels (`block_fmha_batch_prefill_pipeline_qr_ks_vs_async`) - Triggered by paged KV layouts where `kv_page_indices.numel() == ceil(seqlen_k / page_size)` exactly - Manifests as: `Memory access fault by GPU node-X (Agent handle: 0x...)` followed by `Aborted (core dumped)` - Silent corruption (no fault, wrong output) when the OOB read happens to land in zero-initialized memory ### Root cause `load_physical_pages` performs **lookahead reads** on the page table to prefetch K/V tiles for the next iteration. When the page table for a batch has exactly `N` entries, the V-tile prefetch indexes `page_idx[N]` (one past the last valid entry), reading either uninitialized memory or the next batch's slot. On gfx942 with a tightly-sized page table, the read crosses into an unmapped page and triggers an HSA page fault. The bug was masked in earlier testing because most test harnesses pad `kv_page_indices` with trailing zeros — OOB reads then return `page_id = 0`, a valid in-cache page, producing silent numerical drift instead of a fault. ### Fix design Thread `max_page_table_idx = (seqlen_k - 1) / page_size` from the kernel layer down to `load_physical_pages`, and clamp every page-table read with `ck_tile::min()`. Applied to **all four code paths** in the V prefetch: | Branch | What it does | Clamp applied | |--------|-------------|---------------| | `kIsKcache` | K prefetch loop | `min(global_token_idx >> kLog2PageSize, max_page_table_idx)` | | V LINEAR (`page_size == 1`) | One token = one page | `min(global_token_idx, max_page_table_idx)` | | V crosses pages (`kVTileCrossesPages`) | Per-thread page lookup | `min(global_token_idx >> kLog2PageSize, max_page_table_idx)` | | V single page (lane0 broadcast) | `readfirstlane`-uniform lookup | `min(... >> kLog2PageSize, max_page_table_idx)` | ### Key design decisions **Mandatory parameter, not optional with a sentinel default.** An optional `max_page_table_idx = INT32_MAX` default would let the bug silently come back at any new callsite that forgets to pass it. Making it mandatory forces every caller to opt in explicitly and surfaces missed callsites at compile time. **`seqlen_k == 0` clamps to 0** instead of underflowing `(0 - 1) / page_size` to `-1`. The empty-batch case is rare but well-defined: clamp every read to slot 0. **Single computation in the kernel layer.** `FmhaBatchPrefillWithPagedKVCacheKernel` computes `max_page_table_idx` once per batch and forwards it through every QScale branch (PERTENSOR / KV_BLOCKSCALE / default). All three `operator()` overloads of the pipeline (rich, default forwarder, KV_BLOCKSCALE forwarder) take and forward the parameter. ### Files changed | File | Change | |------|--------| | `include/ck_tile/ops/fmha/kernel/fmha_batch_prefill_kernel.hpp` | Compute `max_page_table_idx` per batch, forward to all 3 QScale branches | | `include/ck_tile/ops/fmha/pipeline/block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp` | Add `max_page_table_idx` to `load_physical_pages` and 3 `operator()` overloads; clamp page-id reads in 4 code paths | ## Test plan - [x] AICK-1171 reproducer verified on MI-308X (gfx942) - [x] New pytest case `test_batch_prefill_aick1171_oob_page_table_read` in aiter, parametrized over `total_blocks ∈ {160, 164, 168, 176, 208, 256}` (matches the `crash1_r8_*` bisect family) - [x] Full FMHA batch prefill suite on gfx942 + gfx950 ## Linked issue AICK-1171.