New CK tile kernel variant that fuses topk_softmax and moe_sorting into
a single kernel launch for the decode case (M=1, single token). The
pipeline inlines the topk loop with results in shared memory (no global
scratch), then thread 0 emits moe_sorting-compatible packed output.
Includes CMake target tile_example_topk_softmax_decode with built-in
comparison benchmark against the separate topk+sorting baseline.
Validated on gfx950, E=8..1024, k=1..8, bf16/fp16.
Made-with: Cursor