mirror of
https://github.com/kvcache-ai/sglang.git
synced 2026-06-30 03:37:51 +00:00
During CUDA graph capture (regular or PCG), torch.cuda.synchronize() and CPU-GPU expert coordination are not allowed. Detect capture mode via is_in_piecewise_cuda_graph() and torch.cuda.is_current_stream_capturing(), and delegate directly to the GPU method in those cases. This enables running Qwen3.5 with --attention-backend triton without --disable-cuda-graph, improving decode from ~11 tok/s to ~65 tok/s.