Files
sglang/python
Benjamin F ebaff7729b fix: regressions (scheduler hang, cuda graph TypeError, MXFP4 cache, rsf double-apply) (#50)
* fix(v4-flash): remove broken MXFP4 weight cache + fix rsf double-apply

move routed_scaling_factor application from inside
apply_v4_triton_kernels_moe to the caller (mxfp4_deepseek.apply),
mirroring the trtllm path convention. This fixes a latent double-apply
when SGLANG_OPT_MXFP4_FUSE_RSF_SHARED_ADD is enabled.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(scheduler): revert PR #38 req_pool changes that break TP-only mode

PR #38 introduced changes that together cause scheduler hang on
TP-only configurations with max_running_requests=1:

1. scheduler.py: Removed `if self.pp_size > 1:` guard in
   get_num_allocatable_reqs, causing TP-only mode to check
   available_size() unconditionally.

2. memory_pool.py: Changed free_slots from `range(size)` to
   `range(1, size)` to reserve index 0. With max_running_requests=1,
   this produces empty free_slots list.

3. scheduler_runtime_checker_mixin.py: Changed expected_free from
   `req_total_size` to `req_total_size - 1` to match the reserved slot.

This fix reverts all 4 locations to v0.6.1.post1 behavior.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(cuda_graph): use out-of-band _replay_forward_batch for non-DSV4 backends

Cherry-pick fix from upstream 3ffc34dbe to resolve TypeError when
non-DSV4 backends (TritonAttnBackend, etc.) receive unexpected
out_cache_loc kwarg during CUDA graph replay.

Instead of passing out_cache_loc as a parameter (which requires all
backends to update their signatures), use an out-of-band attribute:
- Set attn_backend._replay_forward_batch before the call
- DSV4 backend reads out_cache_loc from this attribute
- Clear the attribute after the call

Conflict resolution: kept kt-sglang's attribute path
`self.model_runner.attn_backend` (vs upstream's `self.attn_backend`).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: remove undefined _GraphBucket reference in cuda graph replay

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-14 14:00:27 +08:00
..