Files
composable_kernel/example
Amir Ghamarian d93efe1b61 Add fused topk_softmax_decode kernel for M=1 MoE decode
New CK tile kernel variant that fuses topk_softmax and moe_sorting into
a single kernel launch for the decode case (M=1, single token). The
pipeline inlines the topk loop with results in shared memory (no global
scratch), then thread 0 emits moe_sorting-compatible packed output.

Includes CMake target tile_example_topk_softmax_decode with built-in
comparison benchmark against the separate topk+sorting baseline.

Validated on gfx950, E=8..1024, k=1..8, bf16/fp16.

Made-with: Cursor
2026-03-29 18:06:03 +00:00
..
2026-01-14 07:31:45 -08:00