Commit Graph

7920 Commits

Author SHA1 Message Date
R0CKSTAR
87e50f20f6 [Apple Silicon][MLX] Cache seq_lens-derived tensors in BatchedDecodeContext (#23470)
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
2026-04-23 18:12:26 -07:00
Mick
c0166355ae [diffusion] CI: minor refactor CI (#23576) 2026-04-24 08:48:31 +08:00
Cheng Wan
d9c72bdd2b Skip unselected experts in flashinfer_trtllm (#23493) 2026-04-23 17:30:19 -07:00
Cheng Wan
000a2525e1 Move expert_mask_gpu from FusedMoE layer to StandardDispatcher (#23585)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 17:17:27 -07:00
Lianmin Zheng
95d021b523 Pre-set SWA cache location in CudaGraphRunner (#23552) 2026-04-23 16:51:29 -07:00
Lianmin Zheng
bb962b0046 Fix MoE no_combine: skip router weight in down projection (#23545) 2026-04-23 16:47:58 -07:00
Sundara Raman Ramachandran
cf88fdcc9c Expose child process PIDs from Engine for health check support (#23320) 2026-04-23 16:44:49 -07:00
sglang-bot
f3b88e080a chore: bump flashinfer version to 0.6.8.post1 (#23281)
Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com>
2026-04-23 15:23:03 -07:00
Byron Hsu
17210350fd [PD+DP] Allow PrefillDelayer in disaggregated-prefill mode (#23588) 2026-04-23 14:51:16 -07:00
Alex Nails
579bd0b152 [bug fix] has_fp8_weights_in_checkpoint: handle HF repo IDs, not just local paths (#23542)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 12:56:49 -07:00
WangHao-hw
80125febb1 [BUGFIX]Fix Ascend backend pre-allocated range in NPU Graph Mode. (#22778) 2026-04-24 01:23:35 +08:00
Jinghong Li
c6872fc8fb Fix: fallback to torch API when NVML memory query is not supported (#23426)
Co-authored-by: ronnie_zheng <zl19940307@163.com>
2026-04-23 19:26:04 +03:00
Jie Hao
86ed0680d7 feat: add OpenTelemetry tracing to DiffGenerator (#21254) 2026-04-23 09:25:23 -07:00
Arseniy Mironov
76e4c5a1f8 [Diffusion][NPU][Bugfix] Ascend_fa crashes when sequence parallelism is used. (#23572)
Co-authored-by: Napkin-AI <arseniy.mironov.dev@gmail.com>
2026-04-23 19:21:30 +03:00
Baichuan
54e21bb3a5 [fix] Fix dynamic chunking profiling crash on GLM-5 models (#23060)
Co-authored-by: liubaichuan <liubaichuan@infini-ai.com>
2026-04-23 19:30:57 +08:00
Xinyi Song
cd459af4e2 [AMD] Use bpreshuffle FP8 blockscale GEMM to replace ABScale GEMM (#23319)
Co-authored-by: HaiShaw <hixiao@gmail.com>
2026-04-23 01:51:30 -07:00
Ethan (Yusheng) Su
2ef1a21d5e [bug fix] fix: detect FP8 weights from safetensors header instead of ass… (#23414) 2026-04-23 14:49:57 +08:00
Kangyan-Zhou
f1a70b4666 [Observability] Add HTTP sidecar endpoints and FlushCache gRPC RPC for gRPC mode (#22500)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-22 23:06:10 -07:00
mispa-ms
3c5b1f0810 [diffusion] fix: fix --warmup-resolutions hang with --enable-cfg-parallel (#23198) 2026-04-23 13:39:20 +08:00
Kangyan-Zhou
18359aadc8 [CI] Lower GSM8K baselines for B200 nightly after eval unification (#22136)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-22 22:30:54 -07:00
HuangJi
9716599383 [diffusion] fix: avoid illegal memory access in qwen image (#22953) 2026-04-23 12:41:26 +08:00
Mick
4d3c7e781a [diffusion] CI: do not retry consistency failures (#23517) 2026-04-23 12:39:37 +08:00
maocheng23
d3aa9128be Change SGLANG_SIMULATE_ACC_METHOD to 'match-expected' (#23527) 2026-04-22 21:26:08 -07:00
Jimmy Shong
68a8ed9b11 [Fix/Kernel] Add JIT rmsnorm_hf kernel to fix transformers backend MMLU accuracy regression (#22931)
Co-authored-by: SGLang CI <ci@sglang.ai>
2026-04-23 12:00:31 +08:00
Liangsheng Yin
0f21fe924a fix ngram greedy verify kwarg (#23521) 2026-04-22 20:49:54 -07:00
ori
887d380ace [MUSA] Resolve output garbage in Context Parallel on MusaFlashAttentionBackend (#23270)
Co-authored-by: zhiguo.qin <zhiguo.qin@mthreads.com>
2026-04-22 20:22:20 -07:00
Liangsheng Yin
f611dd24f1 fix retrive -> retrieve typo (#23503)
Co-authored-by: SoluMilken <19161836+solumilken@users.noreply.github.com>
2026-04-22 16:35:04 -07:00
Yanbin Jiang
917d2aa1dc [LoRA] Fix EP + per-expert MoE LoRA illegal memory access (#23178) 2026-04-22 14:22:32 -07:00
Sam Shleifer
b9e33d6a5b Dual MoE CUDA graph capture for lora/nolora batches (#22809) 2026-04-22 14:11:11 -07:00
jianan-gu
ad0fc88810 [CPU] [Quantization] Add GPTQ/AWQ 4bits quantization support for CPU (#22685)
Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>
2026-04-22 13:34:02 -07:00
Byron Hsu
0b77284587 [minor] Make DEFAULT_FORCE_STREAM_INTERVAL configurable via SGLANG_FORCE_STREAM_INTERVAL (#23215) 2026-04-22 13:05:40 -07:00
JasonHe-WQ
f85e3140bf Fix:fix(timeout): fix timeout not propagated (#21944) 2026-04-22 12:48:48 -07:00
Yuxuan Zhang
28cfd3d272 Support defer_loading field at function level for Chat Completions API (#22702)
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
2026-04-22 10:09:54 -07:00
Todobe
92f28e9ba8 [NPU]Fix GLM-4.7-Flash failed on NPU (#22509) 2026-04-23 01:06:58 +08:00
cctry
0addd185af Fix /generate endpoint crash when sampling params contain null values (#23401) 2026-04-22 09:56:10 -07:00
Aleksi Vesanto
ac351c1f04 [diffusion] [AMD] model: allow AITER backends in Flux 2 pipeline (#22802) 2026-04-22 08:15:44 -07:00
Shenxiu Liu
8b78e0888c Skip mamba_pool_idx revert for session requests in _get_new_batch_prefill_raw (#23327) 2026-04-22 22:28:06 +08:00
Mick
4323fce82a fix: dot-boundary match in is_layer_skipped for FP8 modules_to_not_convert (#23467) 2026-04-22 22:16:22 +08:00
Shangming Cai
1c06a3d072 [CI] Move disaggregation basic CI back to 2-gpu suite (#23447) 2026-04-22 17:50:33 +08:00
Ming Yang
7b10f01d1c [model_runner] Label forward steps in profile traces with mode and token counts (#23419) 2026-04-22 02:31:18 -07:00
inkcherry
1e34cd0ba5 PD streaming: batch notify + SSE fast path (#22658) 2026-04-22 02:21:02 -07:00
Fengyuan Yu
5c245d978f [Diffusion] Add mixed-resolution benchmark support (for #20762) (#20863)
Signed-off-by: Fengyuan Yu <15fengyuan@gmail.com>
Co-authored-by: Fengyuan Yu <15fengyuan@gmail.com>
Co-authored-by: ronnie_zheng <zl19940307@163.com>
2026-04-22 09:22:19 +03:00
cctry
e39f0f4ff3 Use libdevice tanh and support 2D-strided tensors in fused softcap kernel (#23157)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-04-21 22:54:37 -07:00
Wenxuan Tan
c3ea2d7b92 Rename mixed_with_decode_tokens in mixed chunk prefill adder (#6506)
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
2026-04-21 22:48:34 -07:00
Tarushii Goel
7607e4d180 py-spy without --native for ARM devices (#23410) 2026-04-21 20:45:52 -07:00
shuwenn
4befc31408 fix: pass v_head_dim to MHA KV pools and validate MiMo HiCache geometry (#23173)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-21 19:48:45 -07:00
MARATRIX
bf5e71dcec [MUSA][19/N] Support HiCache with pin_memory allocator (#23361)
Signed-off-by: yafeng.li <yafeng.li@mthreads.com>
2026-04-21 19:45:53 -07:00
Piotr Mazurek
6cf0b004ca [MoE] Add LFM2 MoE tuning support + tuned configs for H100/B200/MI325X (#22791)
Co-authored-by: Piotr Mazurek <piotr.mazurek@liquid.ai>
2026-04-21 18:32:05 -07:00
Byron Hsu
c090f71bf2 feat: enable SGLANG_PATCH_TOKENIZER by default (#23409) 2026-04-21 17:53:43 -07:00
hlu1
415f64e763 Add MambaPool kvcache offloading during retraction (#22493) 2026-04-22 08:51:03 +08:00