[rocm-libraries] ROCm/rocm-libraries#6242 (commit f46ac14)

[CK] Fix out of bounds modifications caused by negative
 topk_ids in MoeSortingMultiPhaseKernel_P0_v1 (#6242)

## Motivation

Fix sglang randomly crash by filter negative topk ids.

## Technical Details

In sglang expert parallel mode, there may be idle batch (batch=0) fired,
it will reuse batch=1 resource in cuda graph mode. But in topk op, it
will set non used topk ids to -1, in idle batch case, all topk ids are
set to -1. In `MoeSortingMultiPhaseKernel_P0_v1` negative expert id will
cause overwrite somewhere and sglang may randomly crash.
Except idle batch case, if the captured batch sizes are discrete, there
may be -1 of expert id due to the similar logic.

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: zovonoir <jialzhu@amd.com>
This commit is contained in:
Luo Cheng
2026-04-23 22:45:32 +00:00
committed by assistant-librarian[bot]
parent 281d1bf50b
commit 2fae12cbbb

View File

@@ -1682,7 +1682,7 @@ struct MoeSortingMultiPhaseKernel_P0_v1
IndexType eid = x[j.value]; // ext_vector_type must use int to []
uint32_t curr_token_id, curr_topk_id;
kargs.topk_mdiv.divmod(i * Problem::SubTokenTile + j, curr_token_id, curr_topk_id);
if(eid < kargs.num_experts)
if(eid < kargs.num_experts && eid >= 0)
{
if constexpr(Problem::LocalToken)
{