* chore(copyright) update library wide CMakeLists.txt files copyright header template * Fix build --------- Co-authored-by: Sami Remes <samremes@amd.com>
MoE-SmoothQuant with CK Tile
This example demonstrates MoE-SmoothQuant, a fused quantization operation for Mixture-of-Experts (MoE) models, using the CK Tile programming model. Unlike standard SmoothQuant, the input scale is expert-dependent, and the operation is fused with top-k expert selection. Specifically, it quantizes the top-k experts' outputs for each token using their respective expert scales. The input scale is from different expert [expert, hidden], and we need reuse the topk-id from previous topk-softmax and select the corresponding expert from current topk, and expand the output/per-token-scale by topk.
This diagram depicts moe-smoothquant using ck_tile tile-programming implementation.

Algorithm and Math
Given:
- Input:
Xof shape[\text{tokens}, \text{topk}, \text{hidden}] - Expert scales:
Sof shape[\text{experts}, \text{hidden}] - TopK indices:
Iof shape[\text{tokens}, \text{topk}]
Steps:
- For each token
tand itskselected experts:- Select scale
S_{I_{t,k}, :}for the $k$-th expert. - Scale:
Y_{t,k,j} = X_{t,k,j} \cdot S_{I_{t,k}, j}
- Select scale
- Rowwise Dynamic Quantization (per token-expert pair):
s_{t,k} = \max_j |Y_{t,k,j}| / 127Q_{t,k,j} = \text{round}(Y_{t,k,j} / s_{t,k}),Q_{t,k,j} \in \text{int8}
Output:
- Quantized tensor
Q(int8) - Per-token-expert scale
s(fp32)
Tile Programming Model
- Tiles: Each thread block processes a tile (block of tokens, experts, or hidden units).
- Tile Engine: Loads input, selects expert scales via top-k indices, applies scaling and quantization, and writes results.
- Pipeline: Modular, can be extended for further fusion.
Build & Run
mkdir build && cd build
sh ../script/cmake-ck-dev.sh ../ <arch>
make tile_example_moe_smoothquant -j`nproc`
./bin/tile_example_moe_smoothquant -?
Source Structure
- Kernel:
moe_smoothquant.hpp(tile-programming kernel template) - Executable:
moe_smoothquant.cpp - Build:
CMakeLists.txt,instances/,misc/,script/
Technical Notes
- Expert-dependent scaling: Each token's top-k experts use their own per-hidden-unit scale, requiring indirect indexing and efficient memory access.
- Fused with top-k: The kernel uses top-k indices from gating to select the correct expert scale for each token.
- Rowwise quantization: Each token-expert pair is quantized independently for maximum accuracy.
Related CK Tile Examples
- 09_topk_softmax: TopK-Softmax for MoE gating
- 13_moe_sorting: MoE sorting for expert dispatch
- 12_smoothquant: Standard SmoothQuant
For distribution, see include/ck_tile/tile_program/tile_distribution/.
example
args:
-t tokens dimension (default:3328)
-h hidden_size dimension (default:4096)
-e experts (default:32)
-k topk (default:5)
-stride stride per row, if -1 then equal to hidden_size (default:-1)
-v cpu validation or not (default:1)
-kname print kernel name or not (default:1)
-prec_i input precision, fp16/bf16 (default:fp16)
-prec_o precision, int8/fp8 (default:int8)
-warmup cold iter (default:5)
-repeat hot iter (default:20)
-json 0: No Json, 1: Dump Results in Json format (default:0)
-jsonfile json file name to dump results (default:moe_smoothquant.json)