feat(ck): Add swiglu_oai (OAI SwiGLU) activation to XDL
2-stage MoE epilogue. (#8944)
## Motivation
Enable the OAI-form SwiGLU activation (`swiglu_oai`, `gate *
sigmoid(1.702 * gate) * (up + 1)`, gpt-oss style) in the Composable
Kernel XDL 2-stage MoE path. The MoE gridwise kernel epilogue currently
supports only silu/gelu; this adds swiglu_oai so OAI-style MoE models
can use this path.
JIRA ID : ROCM-27213
## Technical Details
- `gridwise_gemm_xdl_cshuffle_common.hpp`: add
`Activation::swiglu_oai_and_mul = 3`.
- `gridwise_moe_gemm.hpp`: add the `apply_swiglu_oai_activation` helper
(`gate * sigmoid(1.702 * gate) * (up + 1)`, clamp `gate <= 7` and `up in
[-7, 7]`, OAI/gpt-oss form) and wire it into all 4 epilogue paths (quant
+ non-quant x `Run` / `Run_2Lds`).
- The activation is applied in fp32 in the epilogue and is orthogonal to
the GEMM compute (MFMA/tile/pipeline untouched) and to quantization
(existing per-token dequant reused). Only the non-blockscale gridwise
kernel is changed.
- Consumed by aiter via ROCm/aiter#3886 (dispatch + codegen);
review/merge together.
## Test Plan
Validate the new epilogue branch against a torch fp32 OAI-SwiGLU
reference through the aiter per-token fp8 MoE path (op-isolate on gfx942
/ MI308X).
## Test Result
cos_sim = 0.999993 vs the torch fp32 OAI-SwiGLU reference; no NaN.
Confirmed the per-token fp8 path dispatches to this `GridwiseMoeGemm`
kernel (rocprofv3) and runs the swiglu_oai epilogue branch.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.