From 2fae12cbbbd86874b18288c93fd76cbbcb657b1c Mon Sep 17 00:00:00 2001 From: Luo Cheng <235219119+luocheng25@users.noreply.github.com> Date: Thu, 23 Apr 2026 22:45:32 +0000 Subject: [PATCH] [rocm-libraries] ROCm/rocm-libraries#6242 (commit f46ac14) [CK] Fix out of bounds modifications caused by negative topk_ids in MoeSortingMultiPhaseKernel_P0_v1 (#6242) ## Motivation Fix sglang randomly crash by filter negative topk ids. ## Technical Details In sglang expert parallel mode, there may be idle batch (batch=0) fired, it will reuse batch=1 resource in cuda graph mode. But in topk op, it will set non used topk ids to -1, in idle batch case, all topk ids are set to -1. In `MoeSortingMultiPhaseKernel_P0_v1` negative expert id will cause overwrite somewhere and sglang may randomly crash. Except idle batch case, if the captured batch sizes are discrete, there may be -1 of expert id due to the similar logic. ## Test Plan ## Test Result ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: zovonoir --- include/ck_tile/ops/fused_moe/kernel/moe_sorting_kernel.hpp | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/ck_tile/ops/fused_moe/kernel/moe_sorting_kernel.hpp b/include/ck_tile/ops/fused_moe/kernel/moe_sorting_kernel.hpp index 06ab134f85..61d8ecc42c 100644 --- a/include/ck_tile/ops/fused_moe/kernel/moe_sorting_kernel.hpp +++ b/include/ck_tile/ops/fused_moe/kernel/moe_sorting_kernel.hpp @@ -1682,7 +1682,7 @@ struct MoeSortingMultiPhaseKernel_P0_v1 IndexType eid = x[j.value]; // ext_vector_type must use int to [] uint32_t curr_token_id, curr_topk_id; kargs.topk_mdiv.divmod(i * Problem::SubTokenTile + j, curr_token_id, curr_topk_id); - if(eid < kargs.num_experts) + if(eid < kargs.num_experts && eid >= 0) { if constexpr(Problem::LocalToken) {