mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-30 11:47:48 +00:00
New CK tile kernel variant that fuses topk_softmax and moe_sorting into a single kernel launch for the decode case (M=1, single token). The pipeline inlines the topk loop with results in shared memory (no global scratch), then thread 0 emits moe_sorting-compatible packed output. Includes CMake target tile_example_topk_softmax_decode with built-in comparison benchmark against the separate topk+sorting baseline. Validated on gfx950, E=8..1024, k=1..8, bf16/fp16. Made-with: Cursor