mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-07-03 21:58:13 +00:00
feat(ck): Add swiglu_oai (OAI SwiGLU) activation to XDL 2-stage MoE epilogue. (#8944) ## Motivation Enable the OAI-form SwiGLU activation (`swiglu_oai`, `gate * sigmoid(1.702 * gate) * (up + 1)`, gpt-oss style) in the Composable Kernel XDL 2-stage MoE path. The MoE gridwise kernel epilogue currently supports only silu/gelu; this adds swiglu_oai so OAI-style MoE models can use this path. JIRA ID : ROCM-27213 ## Technical Details - `gridwise_gemm_xdl_cshuffle_common.hpp`: add `Activation::swiglu_oai_and_mul = 3`. - `gridwise_moe_gemm.hpp`: add the `apply_swiglu_oai_activation` helper (`gate * sigmoid(1.702 * gate) * (up + 1)`, clamp `gate <= 7` and `up in [-7, 7]`, OAI/gpt-oss form) and wire it into all 4 epilogue paths (quant + non-quant x `Run` / `Run_2Lds`). - The activation is applied in fp32 in the epilogue and is orthogonal to the GEMM compute (MFMA/tile/pipeline untouched) and to quantization (existing per-token dequant reused). Only the non-blockscale gridwise kernel is changed. - Consumed by aiter via ROCm/aiter#3886 (dispatch + codegen); review/merge together. ## Test Plan Validate the new epilogue branch against a torch fp32 OAI-SwiGLU reference through the aiter per-token fp8 MoE path (op-isolate on gfx942 / MI308X). ## Test Result cos_sim = 0.999993 vs the torch fp32 OAI-SwiGLU reference; no NaN. Confirmed the per-token fp8 path dispatches to this `GridwiseMoeGemm` kernel (rocprofv3) and runs the swiglu_oai epilogue branch. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.