mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-07-02 13:17:36 +00:00
[CK Tile] Fix Stream-K flag store: wave-uniform SGPR address for scalar s_store/s_load (#8099) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Stream-K grouped-conv (and GEMM) kernels fail to assemble for some instances: the inline scalar flag store/load gets a VGPR address operand, which scalar-memory instructions reject (`invalid operand for instruction`). This blocks Stream-K instances from building. ## Technical Details - `StreamKReductionOps::{Signal,Wait}StorePartialDone` (shared by GEMM and conv, added in #5393) take `kargs` by `const&` and feed `kargs.workspace_ptr` / `cta_idx` into inline `s_store_dword`/`s_load_dword` with `"s"` constraints. For some instantiations the compiler can't keep the pointer wave-uniform and emits a VGPR address. - Fix: route the pointer and offset through `amd_wave_read_first_lane` so the scalar-memory address is a wave-uniform SGPR before the asm. Same instructions, no algorithm change. - Not arch-specific: the affected instance fails on gfx908/gfx90a/gfx942/gfx950 without the fix; whether the compiler spills to a VGPR depends on the instantiation (tile/warp/pipeline), not the target. ## Test Plan - Compile the previously-failing dispatcher instance for gfx908/gfx90a/gfx942/gfx950. - `test_ck_tile_grouped_conv_bwd_weight_streamk` on gfx942, gfx90a, gfx950 hardware. - gfx950 perf A/B (example, bf16/tree, 10 runs each) with vs without the change. ## Test Result - Failing instance now assembles on all four archs; previously failed on every one. - 30/30 conv Stream-K tests pass on gfx942, gfx90a, gfx950. - gfx950 perf delta -0.13% (within run-to-run noise) — no regression from the added readfirstlane on the cold flag path. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>