root
e8587b86c2
Fix CK-UA pipeline: s_waitcnt_vmcnt<0> in fmha_post_process
...
The final V tile's async load was not properly waited on before reading
from LDS: s_waitcnt_vmcnt<K_inst> allowed V_inst outstanding loads
(a no-op when K_inst == V_inst). The last loop iteration never prefetches
K, so only V is outstanding. Use s_waitcnt_vmcnt<0> unconditionally.
This partially fixes the BS32 race condition for production workloads
(maxk >= 256). A deeper pipeline race remains for very short KV
sequences (maxk < ~165, 2-5 pages) with block_size=32 at high batch.
Made-with: Cursor
2026-04-01 23:04:07 +00:00
..
2025-11-26 11:00:05 -07:00
2025-12-02 13:30:27 +01:00
2025-11-26 11:00:05 -07:00
2026-03-27 09:18:14 +00:00
2026-03-13 01:21:08 +00:00
2026-03-31 03:40:25 +00:00
2026-03-03 21:55:14 +00:00
2026-04-01 16:24:31 +00:00
2026-02-25 16:13:13 +00:00
2026-04-01 16:22:08 +00:00
2026-03-27 20:37:23 +00:00
2026-03-17 18:58:56 +00:00
2026-03-31 08:03:41 +00:00
2025-11-26 11:00:05 -07:00
2025-11-26 11:00:05 -07:00
2025-11-26 11:00:05 -07:00
2025-11-26 11:00:05 -07:00
2026-01-13 09:21:29 -08:00
2026-01-27 12:56:09 -08:00
2026-01-30 10:52:19 +08:00
2025-11-26 11:00:05 -07:00
2026-02-11 05:52:42 +00:00
2026-01-31 00:59:47 +08:00
2025-11-26 11:00:05 -07:00
2026-01-13 09:21:29 -08:00
2026-04-01 23:04:07 +00:00
2026-03-02 12:21:44 +00:00
2026-03-02 12:21:44 +00:00
2026-03-02 12:21:44 +00:00
2026-03-02 12:21:44 +00:00
2026-03-02 12:21:44 +00:00
2026-03-02 12:21:44 +00:00
2026-03-02 12:21:44 +00:00
2026-03-11 10:00:52 +00:00
2026-03-02 12:21:44 +00:00
2026-03-12 08:27:49 +00:00
2026-03-16 08:31:56 +00:00
2026-03-16 08:31:56 +00:00
2026-03-12 08:27:49 +00:00
2026-03-02 12:21:44 +00:00
2026-03-02 12:21:44 +00:00
2025-11-26 11:00:05 -07:00
2026-03-02 12:21:44 +00:00
2026-03-02 12:21:44 +00:00
2026-03-02 12:21:44 +00:00
2026-03-02 12:21:44 +00:00
2026-03-02 12:21:44 +00:00
2026-03-02 12:21:44 +00:00
2026-03-02 12:21:44 +00:00
2026-03-02 12:21:44 +00:00
2026-03-02 12:21:44 +00:00
2026-03-02 12:21:44 +00:00
2026-04-01 16:39:15 +00:00