From d7475e812587805b9f2b9ef33a4543f53f97b03b Mon Sep 17 00:00:00 2001 From: Luo Cheng Date: Fri, 24 Apr 2026 06:44:37 +0800 Subject: [PATCH] [CK] Fix out of bounds modifications caused by negative topk_ids in MoeSortingMultiPhaseKernel_P0_v1 (#6242) ## Motivation Fix sglang randomly crash by filter negative topk ids. ## Technical Details In sglang expert parallel mode, there may be idle batch (batch=0) fired, it will reuse batch=1 resource in cuda graph mode. But in topk op, it will set non used topk ids to -1, in idle batch case, all topk ids are set to -1. In `MoeSortingMultiPhaseKernel_P0_v1` negative expert id will cause overwrite somewhere and sglang may randomly crash. Except idle batch case, if the captured batch sizes are discrete, there may be -1 of expert id due to the similar logic. ## Test Plan ## Test Result ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: zovonoir --- include/ck_tile/ops/fused_moe/kernel/moe_sorting_kernel.hpp | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/ck_tile/ops/fused_moe/kernel/moe_sorting_kernel.hpp b/include/ck_tile/ops/fused_moe/kernel/moe_sorting_kernel.hpp index 07eda483d2..7d766bbe67 100644 --- a/include/ck_tile/ops/fused_moe/kernel/moe_sorting_kernel.hpp +++ b/include/ck_tile/ops/fused_moe/kernel/moe_sorting_kernel.hpp @@ -1685,7 +1685,7 @@ struct MoeSortingMultiPhaseKernel_P0_v1 IndexType eid = x[j.value]; // ext_vector_type must use int to [] uint32_t curr_token_id, curr_topk_id; kargs.topk_mdiv.divmod(i * Problem::SubTokenTile + j, curr_token_id, curr_topk_id); - if(eid < kargs.num_experts) + if(eid < kargs.num_experts && eid >= 0) { if constexpr(Problem::LocalToken) {