Files
composable_kernel/include/ck/tensor_operation
Linjun-AMD f4e6fad973 [rocm-libraries] ROCm/rocm-libraries#8944 (commit 7be2dbb)
feat(ck): Add swiglu_oai (OAI SwiGLU) activation to XDL
 2-stage MoE epilogue. (#8944)

## Motivation

Enable the OAI-form SwiGLU activation (`swiglu_oai`, `gate *
sigmoid(1.702 * gate) * (up + 1)`, gpt-oss style) in the Composable
Kernel XDL 2-stage MoE path. The MoE gridwise kernel epilogue currently
supports only silu/gelu; this adds swiglu_oai so OAI-style MoE models
can use this path.

JIRA ID : ROCM-27213

## Technical Details

- `gridwise_gemm_xdl_cshuffle_common.hpp`: add
`Activation::swiglu_oai_and_mul = 3`.
- `gridwise_moe_gemm.hpp`: add the `apply_swiglu_oai_activation` helper
(`gate * sigmoid(1.702 * gate) * (up + 1)`, clamp `gate <= 7` and `up in
[-7, 7]`, OAI/gpt-oss form) and wire it into all 4 epilogue paths (quant
+ non-quant x `Run` / `Run_2Lds`).
- The activation is applied in fp32 in the epilogue and is orthogonal to
the GEMM compute (MFMA/tile/pipeline untouched) and to quantization
(existing per-token dequant reused). Only the non-blockscale gridwise
kernel is changed.
- Consumed by aiter via ROCm/aiter#3886 (dispatch + codegen);
review/merge together.

## Test Plan

Validate the new epilogue branch against a torch fp32 OAI-SwiGLU
reference through the aiter per-token fp8 MoE path (op-isolate on gfx942
/ MI308X).

## Test Result

cos_sim = 0.999993 vs the torch fp32 OAI-SwiGLU reference; no NaN.
Confirmed the per-token fp8 path dispatches to this `GridwiseMoeGemm`
kernel (rocprofv3) and runs the swiglu_oai epilogue branch.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-07-01 12:36:31 +00:00
..