mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-14 18:17:44 +00:00
Title: feat(composablekernel): add swiglustep_and_mul activation to gridwise_moe_gemm Description: ## Motivation Step-3.5-Flash uses a clamped SwiGLU activation (`swiglu_limits[43]=7`, `swiglu_limits[44]=7`) for layers 43 and 44. Without this kernel path, those layers produce BOS token spam because unclamped gate/up values accumulate floating-point noise over 200+ decode steps, degrading output quality (cosine similarity drops from 0.999989 to ~0.998982). ## Changes Add `swiglustep_and_mul` as a new `Activation` enum branch in `gridwise_moe_gemm.hpp`, covering all 4 code paths: - Quantized (A×B scale) + IsInputGemm=true - Quantized (A×B scale) + IsInputGemm=false - Non-quantized + IsInputGemm=true - Non-quantized + IsInputGemm=false The activation computes: gate = silu(gate) gate = clamp(gate, max=7.0f) up = clamp(up, min=-7.0f, max=7.0f) output = gate * up Also handles the `MulRoutedWeight` case (topk weight multiplication) and `pk_i4_t` weight scaling (×16 dequant factor). ## Verification - Tested on gfx950 (MI350X, 8×GPU) - cosine similarity for layers 43/44: **0.999989** (vs 0.998982 before fix) - End-to-end Step-3.5-Flash inference: no BOS spam, output coherent - BF16 tp=2/tp=4 and FP8 tp=2/tp=4 all verified PASS - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.