Linjun-AMD
f17537b7c2
[CK] add swiglustep_and_mul activation to gridwise_moe_gemm (#6873)
Title:
feat(composablekernel): add swiglustep_and_mul activation to
gridwise_moe_gemm
Description:
## Motivation
Step-3.5-Flash uses a clamped SwiGLU activation (`swiglu_limits[43]=7`,
`swiglu_limits[44]=7`) for layers 43 and 44. Without this kernel path,
those layers produce BOS token spam because unclamped gate/up values
accumulate floating-point noise over 200+ decode steps, degrading
output quality (cosine similarity drops from 0.999989 to ~0.998982).
## Changes
Add `swiglustep_and_mul` as a new `Activation` enum branch in
`gridwise_moe_gemm.hpp`, covering all 4 code paths:
- Quantized (A×B scale) + IsInputGemm=true
- Quantized (A×B scale) + IsInputGemm=false
- Non-quantized + IsInputGemm=true
- Non-quantized + IsInputGemm=false
The activation computes:
gate = silu(gate)
gate = clamp(gate, max=7.0f)
up = clamp(up, min=-7.0f, max=7.0f)
output = gate * up
Also handles the `MulRoutedWeight` case (topk weight multiplication) and
`pk_i4_t` weight scaling (×16 dequant factor).
## Verification
- Tested on gfx950 (MI350X, 8×GPU)
- cosine similarity for layers 43/44: **0.999989** (vs 0.998982 before
fix)
- End-to-end Step-3.5-Flash inference: no BOS spam, output coherent
- BF16 tp=2/tp=4 and FP8 tp=2/tp=4 all verified PASS
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-05-07 05:59:47 +00:00
..
2026-03-18 08:45:22 -06:00
2026-03-18 08:45:22 -06:00
2026-03-18 08:45:22 -06:00
2025-11-26 11:00:05 -07:00
2026-03-18 08:45:22 -06:00
2026-04-22 15:47:47 +00:00
2026-01-14 11:02:19 +01:00
2025-11-26 11:00:05 -07:00
2026-01-14 11:02:19 +01:00
2026-03-18 08:45:22 -06:00
2026-03-18 08:45:22 -06:00
2026-03-18 08:45:22 -06:00
2026-03-18 08:45:22 -06:00
2026-03-18 08:45:22 -06:00
2026-01-29 10:29:40 -08:00
2026-01-29 10:29:40 -08:00
2026-01-29 10:29:40 -08:00
2026-01-29 10:29:40 -08:00
2026-03-18 08:45:22 -06:00
2026-01-27 12:49:47 -08:00
2026-01-27 12:49:47 -08:00
2026-03-18 08:45:22 -06:00
2026-03-18 08:45:22 -06:00
2026-03-18 08:45:22 -06:00
2026-03-18 08:45:22 -06:00
2025-11-26 11:00:05 -07:00
2025-11-26 11:00:05 -07:00
2025-11-26 11:00:05 -07:00
2026-03-18 08:45:22 -06:00
2025-11-26 11:00:05 -07:00
2026-03-18 08:45:22 -06:00
2025-11-26 11:00:05 -07:00
2025-11-26 11:00:05 -07:00
2026-04-22 15:47:47 +00:00
2026-01-27 12:49:47 -08:00
2026-03-18 08:45:22 -06:00
2025-11-26 11:00:05 -07:00
2026-01-27 12:49:47 -08:00
2026-04-10 11:17:11 -04:00
2026-02-19 09:13:05 +01:00
2025-12-05 07:44:10 -08:00
2025-11-26 11:00:05 -07:00
2025-11-26 11:00:05 -07:00
2025-11-26 11:00:05 -07:00
2025-11-26 11:00:05 -07:00
2026-03-18 08:45:22 -06:00
2026-01-27 12:49:47 -08:00
2026-01-27 12:49:47 -08:00
2025-11-26 11:00:05 -07:00
2026-04-22 15:47:47 +00:00
2026-02-26 00:28:09 +00:00
2026-04-22 15:47:47 +00:00
2025-11-26 11:00:05 -07:00
2026-04-21 07:24:48 +00:00
2026-04-23 11:16:55 +02:00
2026-04-22 15:47:47 +00:00
2026-04-22 15:47:47 +00:00
2026-04-22 15:47:47 +00:00
2026-04-22 15:47:47 +00:00
2026-04-22 15:47:47 +00:00
2026-04-10 11:17:11 -04:00
2026-04-22 15:47:47 +00:00
2026-04-22 15:47:47 +00:00
2026-04-22 15:47:47 +00:00
2026-04-22 15:47:47 +00:00
2026-04-22 15:47:47 +00:00
2026-04-22 15:47:47 +00:00
2026-04-22 15:47:47 +00:00
2026-03-18 08:45:22 -06:00
2026-01-27 12:49:47 -08:00
2026-02-02 09:39:48 -08:00
2026-01-27 12:49:47 -08:00
2026-04-22 15:47:47 +00:00
2026-04-22 15:47:47 +00:00
2026-04-22 15:47:47 +00:00
2026-01-27 12:49:47 -08:00
2026-04-22 15:47:47 +00:00
2026-01-27 12:49:47 -08:00
2026-01-27 12:49:47 -08:00
2026-01-27 12:49:47 -08:00
2026-04-22 15:47:47 +00:00
2026-05-07 05:59:47 +00:00
2026-04-22 15:47:47 +00:00
2026-04-22 15:47:47 +00:00
2026-04-22 15:47:47 +00:00
2025-11-26 11:00:05 -07:00
2025-11-26 11:00:05 -07:00
2025-11-26 11:00:05 -07:00
2025-11-26 11:00:05 -07:00
2026-03-18 08:45:22 -06:00
2026-03-18 08:45:22 -06:00
2026-03-18 08:45:22 -06:00
2025-11-26 11:00:05 -07:00