sglang

mirror of https://github.com/kvcache-ai/sglang.git synced 2026-07-02 21:37:11 +00:00

Author	SHA1	Message	Date
R0CKSTAR	87e50f20f6	[Apple Silicon][MLX] Cache seq_lens-derived tensors in BatchedDecodeContext (#23470 ) Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>	2026-04-23 18:12:26 -07:00
Mick	c0166355ae	[diffusion] CI: minor refactor CI (#23576 )	2026-04-24 08:48:31 +08:00
Cheng Wan	d9c72bdd2b	Skip unselected experts in flashinfer_trtllm (#23493 )	2026-04-23 17:30:19 -07:00
Cheng Wan	000a2525e1	Move expert_mask_gpu from FusedMoE layer to StandardDispatcher (#23585 ) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 17:17:27 -07:00
Lianmin Zheng	95d021b523	Pre-set SWA cache location in CudaGraphRunner (#23552 )	2026-04-23 16:51:29 -07:00
Lianmin Zheng	bb962b0046	Fix MoE no_combine: skip router weight in down projection (#23545 )	2026-04-23 16:47:58 -07:00
Sundara Raman Ramachandran	cf88fdcc9c	Expose child process PIDs from Engine for health check support (#23320 )	2026-04-23 16:44:49 -07:00
sglang-bot	f3b88e080a	chore: bump flashinfer version to 0.6.8.post1 (#23281 ) Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com>	2026-04-23 15:23:03 -07:00
Byron Hsu	17210350fd	[PD+DP] Allow PrefillDelayer in disaggregated-prefill mode (#23588 )	2026-04-23 14:51:16 -07:00
Alex Nails	579bd0b152	[bug fix] has_fp8_weights_in_checkpoint: handle HF repo IDs, not just local paths (#23542 ) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 12:56:49 -07:00
WangHao-hw	80125febb1	[BUGFIX]Fix Ascend backend pre-allocated range in NPU Graph Mode. (#22778 )	2026-04-24 01:23:35 +08:00
Jinghong Li	c6872fc8fb	Fix: fallback to torch API when NVML memory query is not supported (#23426 ) Co-authored-by: ronnie_zheng <zl19940307@163.com>	2026-04-23 19:26:04 +03:00
Jie Hao	86ed0680d7	feat: add OpenTelemetry tracing to DiffGenerator (#21254 )	2026-04-23 09:25:23 -07:00
Arseniy Mironov	76e4c5a1f8	[Diffusion][NPU][Bugfix] Ascend_fa crashes when sequence parallelism is used. (#23572 ) Co-authored-by: Napkin-AI <arseniy.mironov.dev@gmail.com>	2026-04-23 19:21:30 +03:00
Baichuan	54e21bb3a5	[fix] Fix dynamic chunking profiling crash on GLM-5 models (#23060 ) Co-authored-by: liubaichuan <liubaichuan@infini-ai.com>	2026-04-23 19:30:57 +08:00
Xinyi Song	cd459af4e2	[AMD] Use bpreshuffle FP8 blockscale GEMM to replace ABScale GEMM (#23319 ) Co-authored-by: HaiShaw <hixiao@gmail.com>	2026-04-23 01:51:30 -07:00
Ethan (Yusheng) Su	2ef1a21d5e	[bug fix] fix: detect FP8 weights from safetensors header instead of ass… (#23414 )	2026-04-23 14:49:57 +08:00
Kangyan-Zhou	f1a70b4666	[Observability] Add HTTP sidecar endpoints and FlushCache gRPC RPC for gRPC mode (#22500 ) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-22 23:06:10 -07:00
mispa-ms	3c5b1f0810	[diffusion] fix: fix --warmup-resolutions hang with --enable-cfg-parallel (#23198 )	2026-04-23 13:39:20 +08:00
Kangyan-Zhou	18359aadc8	[CI] Lower GSM8K baselines for B200 nightly after eval unification (#22136 ) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-22 22:30:54 -07:00
HuangJi	9716599383	[diffusion] fix: avoid illegal memory access in qwen image (#22953 )	2026-04-23 12:41:26 +08:00
Mick	4d3c7e781a	[diffusion] CI: do not retry consistency failures (#23517 )	2026-04-23 12:39:37 +08:00
maocheng23	d3aa9128be	Change SGLANG_SIMULATE_ACC_METHOD to 'match-expected' (#23527 )	2026-04-22 21:26:08 -07:00
Jimmy Shong	68a8ed9b11	[Fix/Kernel] Add JIT rmsnorm_hf kernel to fix transformers backend MMLU accuracy regression (#22931 ) Co-authored-by: SGLang CI <ci@sglang.ai>	2026-04-23 12:00:31 +08:00
Liangsheng Yin	0f21fe924a	fix ngram greedy verify kwarg (#23521 )	2026-04-22 20:49:54 -07:00
ori	887d380ace	[MUSA] Resolve output garbage in Context Parallel on MusaFlashAttentionBackend (#23270 ) Co-authored-by: zhiguo.qin <zhiguo.qin@mthreads.com>	2026-04-22 20:22:20 -07:00
Liangsheng Yin	f611dd24f1	fix retrive -> retrieve typo (#23503 ) Co-authored-by: SoluMilken <19161836+solumilken@users.noreply.github.com>	2026-04-22 16:35:04 -07:00
Yanbin Jiang	917d2aa1dc	[LoRA] Fix EP + per-expert MoE LoRA illegal memory access (#23178 )	2026-04-22 14:22:32 -07:00
Sam Shleifer	b9e33d6a5b	Dual MoE CUDA graph capture for lora/nolora batches (#22809 )	2026-04-22 14:11:11 -07:00
jianan-gu	ad0fc88810	[CPU] [Quantization] Add GPTQ/AWQ 4bits quantization support for CPU (#22685 ) Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>	2026-04-22 13:34:02 -07:00
Byron Hsu	0b77284587	[minor] Make DEFAULT_FORCE_STREAM_INTERVAL configurable via SGLANG_FORCE_STREAM_INTERVAL (#23215 )	2026-04-22 13:05:40 -07:00
JasonHe-WQ	f85e3140bf	Fix:fix(timeout): fix timeout not propagated (#21944 )	2026-04-22 12:48:48 -07:00
Yuxuan Zhang	28cfd3d272	Support defer_loading field at function level for Chat Completions API (#22702 ) Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>	2026-04-22 10:09:54 -07:00
Todobe	92f28e9ba8	[NPU]Fix GLM-4.7-Flash failed on NPU (#22509 )	2026-04-23 01:06:58 +08:00
cctry	0addd185af	Fix /generate endpoint crash when sampling params contain null values (#23401 )	2026-04-22 09:56:10 -07:00
Aleksi Vesanto	ac351c1f04	[diffusion] [AMD] model: allow AITER backends in Flux 2 pipeline (#22802 )	2026-04-22 08:15:44 -07:00
Shenxiu Liu	8b78e0888c	Skip mamba_pool_idx revert for session requests in _get_new_batch_prefill_raw (#23327 )	2026-04-22 22:28:06 +08:00
Mick	4323fce82a	fix: dot-boundary match in is_layer_skipped for FP8 modules_to_not_convert (#23467 )	2026-04-22 22:16:22 +08:00
Shangming Cai	1c06a3d072	[CI] Move disaggregation basic CI back to 2-gpu suite (#23447 )	2026-04-22 17:50:33 +08:00
Ming Yang	7b10f01d1c	[model_runner] Label forward steps in profile traces with mode and token counts (#23419 )	2026-04-22 02:31:18 -07:00
inkcherry	1e34cd0ba5	PD streaming: batch notify + SSE fast path (#22658 )	2026-04-22 02:21:02 -07:00
Fengyuan Yu	5c245d978f	[Diffusion] Add mixed-resolution benchmark support (for #20762 ) (#20863 ) Signed-off-by: Fengyuan Yu <15fengyuan@gmail.com> Co-authored-by: Fengyuan Yu <15fengyuan@gmail.com> Co-authored-by: ronnie_zheng <zl19940307@163.com>	2026-04-22 09:22:19 +03:00
cctry	e39f0f4ff3	Use libdevice tanh and support 2D-strided tensors in fused softcap kernel (#23157 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-04-21 22:54:37 -07:00
Wenxuan Tan	c3ea2d7b92	Rename mixed_with_decode_tokens in mixed chunk prefill adder (#6506 ) Co-authored-by: hnyls2002 <lsyincs@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>	2026-04-21 22:48:34 -07:00
Tarushii Goel	7607e4d180	py-spy without `--native` for ARM devices (#23410 )	2026-04-21 20:45:52 -07:00
shuwenn	4befc31408	fix: pass v_head_dim to MHA KV pools and validate MiMo HiCache geometry (#23173 ) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-21 19:48:45 -07:00
MARATRIX	bf5e71dcec	[MUSA][19/N] Support HiCache with pin_memory allocator (#23361 ) Signed-off-by: yafeng.li <yafeng.li@mthreads.com>	2026-04-21 19:45:53 -07:00
Piotr Mazurek	6cf0b004ca	[MoE] Add LFM2 MoE tuning support + tuned configs for H100/B200/MI325X (#22791 ) Co-authored-by: Piotr Mazurek <piotr.mazurek@liquid.ai>	2026-04-21 18:32:05 -07:00
Byron Hsu	c090f71bf2	feat: enable SGLANG_PATCH_TOKENIZER by default (#23409 )	2026-04-21 17:53:43 -07:00
hlu1	415f64e763	Add MambaPool kvcache offloading during retraction (#22493 )	2026-04-22 08:51:03 +08:00

1 2 3 4 5 ...

7920 Commits