juuso-oskari
f5beedb2e9
Add CK-UA decode_d128_mha_m32 / _m16 small-Q tiers
For pure-decode workloads (sq=1) at d=128 the m128 tile wastes most of
its 128 query rows, capping CK below Triton on every batch size in our
sweep (4..256). Add two small-Q tiers that mirror the d=64 GQA-8 ladder:
* decode_d128_mha_m16 : kBlockM=16, 1 warp, 16x16 MFMA (tiny-decode)
* decode_d128_mha_m32 : kBlockM=32, 1 warp, 32x32 MFMA (tiny-decode)
select_config now ladders by (avg_q, max_q): m16 -> m32 -> m128 -> prefill.
d=128 MHA, hq=16/hk=16, sq=1, sk=120k, num_blocks=60k:
batch before after CK BW
4 ~0.95x 0.98x 4.76 TB/s
8 ~0.85x 1.29x 5.00 TB/s
32 ~0.85x 1.14x 5.29 TB/s
64 ~0.75x 0.93x 5.35 TB/s
128 ~1.00x 1.09x 5.39 TB/s
256 ~1.03x 1.02x 5.41 TB/s
Correctness suite stays at 241/245 (same 4 known int32-overflow
failures in the prefill path).
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-12 11:48:19 +00:00
..
2026-01-26 12:57:09 -08:00
2026-01-07 16:30:57 +01:00
2025-11-28 13:49:54 -08:00
2025-11-28 13:49:54 -08:00
2025-12-18 07:59:45 +01:00
2025-11-28 13:49:54 -08:00
2025-10-16 03:10:57 -07:00
2026-03-06 16:28:22 +00:00
2026-01-07 16:30:57 +01:00
2026-03-12 08:48:36 +00:00
2026-02-26 00:28:58 +00:00
2026-03-12 08:48:36 +00:00
2025-12-18 07:59:45 +01:00
2025-11-28 13:49:54 -08:00
2025-11-28 13:49:54 -08:00
2026-02-25 20:11:01 +00:00
2025-11-28 13:49:54 -08:00
2026-01-07 16:30:57 +01:00
2026-01-07 16:30:57 +01:00
2026-01-07 16:30:57 +01:00
2026-01-17 08:30:27 +01:00
2026-01-20 09:39:57 -08:00
2025-11-28 13:49:54 -08:00
2025-11-28 13:49:54 -08:00
2026-01-17 08:30:27 +01:00
2026-01-07 16:30:57 +01:00
2025-11-28 13:49:54 -08:00
2025-11-28 13:49:54 -08:00
2026-01-07 16:30:57 +01:00
2025-11-28 13:49:54 -08:00
2026-02-20 22:41:34 +00:00
2026-01-07 16:30:57 +01:00
2026-01-20 13:06:59 -08:00
2025-12-30 16:25:08 +01:00
2025-11-28 13:49:54 -08:00
2026-01-07 16:30:57 +01:00
2025-11-28 13:49:54 -08:00
2026-01-07 16:30:57 +01:00
2025-11-28 13:49:54 -08:00
2026-01-07 16:30:57 +01:00
2026-01-07 16:30:57 +01:00
2025-11-28 13:49:54 -08:00
2025-11-28 13:49:54 -08:00
2025-11-28 13:49:54 -08:00
2025-11-28 13:49:54 -08:00
2025-11-28 13:49:54 -08:00
2025-11-28 13:49:54 -08:00
2025-11-28 13:49:54 -08:00
2025-11-28 13:49:54 -08:00
2025-11-28 13:49:54 -08:00
2026-02-25 05:17:08 +00:00
2026-01-29 10:29:40 -08:00
2025-11-28 13:49:54 -08:00
2026-01-30 17:02:14 +01:00
2025-11-28 13:49:54 -08:00
2025-11-28 13:49:54 -08:00
2026-02-20 22:41:34 +00:00
2025-11-28 13:49:54 -08:00
2026-01-07 16:30:57 +01:00
2026-01-15 16:43:02 +01:00
2026-01-15 16:43:02 +01:00
2026-05-12 11:48:19 +00:00
2026-01-14 07:31:45 -08:00
2024-12-04 00:46:47 +01:00