sglang

mirror of https://github.com/kvcache-ai/sglang.git synced 2026-07-01 12:17:09 +00:00

Author	SHA1	Message	Date
Elfie Guo	a1d5bc4cce	Avoid using flashinfer_allreduce_fusion when dp attention is enabled. (#11632 )	2025-10-26 12:31:14 -07:00
Zijian Zhang	a8023891f6	model: support NVILA and NVILA Lite (#10399 )	2025-10-26 09:58:09 -07:00
fzyzcjy	0103f374ba	Support DeepGEMM for deterministic inference (#12142 )	2025-10-26 22:36:17 +08:00
zyksir	96a5a949f6	[Fix] fix allreduce bug in Piecewise Graph (#12106 )	2025-10-26 21:15:48 +08:00
Liangsheng Yin	ea385ae85a	Fix ITL metrics when using openai endpoint with spec (#12156 )	2025-10-26 18:06:25 +08:00
Kai-Hsun Chen	6371f7af27	[quantization] AWQ Marlin doesn't work when dtype is bfloat16 (#11494 ) Signed-off-by: Kai-Hsun Chen <khchen@x.ai> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>	2025-10-26 15:49:45 +08:00
Liangsheng Yin	8491c794ad	[misc] depdencies & enviroment flag (#12113 )	2025-10-26 14:52:35 +08:00
Liangsheng Yin	bda3758fac	[log] Make forward iter count optional (#12116 )	2025-10-26 14:51:07 +08:00
Lianmin Zheng	7b36c47b3b	Clean up attention backend selection code & Other minor rename (#12136 )	2025-10-25 23:50:12 -07:00
Kaixi Hou	ff60406429	[NVIDIA] Change default quant method for model_opt (#11991 )	2025-10-25 22:04:57 -07:00
fzyzcjy	c001deba37	Make bmm batch invariant injection optional (#12118 )	2025-10-26 10:18:35 +08:00
Lianmin Zheng	8e70064c37	Clean up server launch code and multi tokenizer (#12132 )	2025-10-25 16:40:27 -07:00
Baizhou Zhang	4b0ac1d52a	Update sgl-kernel version to 0.3.16.post4 (#12125 )	2025-10-25 14:33:33 -07:00
YAMY	c8492978a1	Fix Illegal Instruction/IMA errors when using DP attention -- num_tokens_for_logprob calculation (#12115 )	2025-10-25 12:28:26 -07:00
Lianmin Zheng	4caca1ba04	Clean up server args & Add CI scripts (#12124 )	2025-10-25 11:53:57 -07:00
Lianmin Zheng	ea13cb1452	[Auto Sync] Update test_deterministic.py, test_deterministi... (20251024) (#12083 ) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Stefan He <hebiaobuaa@gmail.com>	2025-10-24 22:28:01 -07:00
vipwangerxiao	8982418957	Fix 'KeyError' for per_token expert distribution recorder (#9501 ) Signed-off-by: Peng Wang <rocking@linux.alibaba.com> Co-authored-by: Peng Wang <rocking@linux.alibaba.com>	2025-10-25 03:28:50 +00:00
fzyzcjy	20bd2271e2	Support true on-policy (#12058 )	2025-10-25 10:23:42 +08:00
Cheng Wan	649949807f	[10/N] MoE Refactor: reorganize deepgemm runner in DeepEPMoE (#12054 )	2025-10-24 19:16:17 -07:00
fzyzcjy	d7056c5236	Enhance tests in deterministic kernels (#12070 )	2025-10-25 08:53:22 +08:00
Jinwu	13bf565d60	[2/N]Support DeepSeek-R1 w4a8 low latency deepep (#8464 ) Co-authored-by: Hank Han <hanhan7630@outlook.com> Co-authored-by: Shangchuan Huang <2510421000@qq.com>	2025-10-24 17:41:16 -07:00
yinghui	e51046beaa	perf: trtllm_mla attention backend spec decoding speedup w/ cuda graph (#12093 )	2025-10-24 16:05:44 -07:00
Minglei Zhu	f4b78d137c	[1/2] deepseek deterministic: support deterministic inference for deepseek arch models on a single GPU (#12000 )	2025-10-24 15:17:28 -07:00
Jonah Bernard	4b046a72d3	docs(server-arguments): add allowed options for each argument (#11560 )	2025-10-24 11:49:20 -07:00
ishandhanani	14203432b4	fix(compile_utils, ep_moe): update environment variable and dtype check (#12034 )	2025-10-24 11:00:12 -07:00
Yuanhang Sun	0bfa394aff	[Fix]: HiCache hasher failed when EAGLE mode enabled (#12025 )	2025-10-24 23:53:13 +08:00
fzyzcjy	e04340bf48	Fix multi processing serializer bug (#11958 )	2025-10-24 22:53:45 +08:00
Xiaoyu Zhang	8470133852	[b200] fix piecewise cuda graph launch bug (#12067 )	2025-10-24 22:36:39 +08:00
Muqi Li	93ef9a094d	[Profiler] expand '~' for `torch_profiler_output_dir` (#11999 )	2025-10-24 17:20:46 +08:00
Muqi Li	b04cd3d487	Add 'gguf' to project dependencies (#12046 )	2025-10-24 17:16:19 +08:00
Yuan Luo	7ef5d8afd4	Revise POINTSV15Chat model (#12049 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>	2025-10-24 17:09:45 +08:00
Qiaolin Yu	71d41212e4	Fix dpsk-r1-fp4 launching crash (#12063 )	2025-10-24 17:04:50 +08:00
Xinyuan Tong	b9fb74f3bc	fix: bench_serving ITL calculation when using spec-decoding (#12064 )	2025-10-24 17:02:44 +08:00
ybyang	e15b63a182	[Fix] fix missing `ipc_name` of `__getitem__` in some IO structs (#12053 ) Signed-off-by: ybyang <ybyang7@iflytek.com>	2025-10-24 16:59:14 +08:00
Yuxuan Zhang	4060ed37cb	Refactoring GLM-4.5 and GLM-4.5V related implementations (#11800 )	2025-10-24 08:22:36 +00:00
fzyzcjy	2342605ef0	Tiny cleanup send_single (#12056 )	2025-10-23 23:53:42 -07:00
Rain Jiang	8e797a47f0	fix: the hardcode hf repo name comparison for deepseek-ocr (#12031 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-10-23 21:37:56 -07:00
Zaili Wang	aa3003f116	Add gguf dependency for cpu/xpu (#12041 )	2025-10-23 21:13:17 -07:00
Yongfei Xu	4793ec7d1a	Opt MHA chunked prefix: merge prefix and extend kv cache to run mha once (#10953 )	2025-10-23 20:58:10 -07:00
Zaili Wang	92009bd28e	fix: fix MMMU loading issue (#11759 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-10-23 20:21:38 -07:00
Baizhou Zhang	4ef981e2b6	Revert "[Fix] Fix lint to pass CI" (#12042 )	2025-10-23 19:44:58 -07:00
Baizhou Zhang	69ed8b67a8	[Fix] Fix lint to pass CI (#12037 )	2025-10-23 19:39:38 -07:00
narutolhy	1801cd199f	support more model in piecewise cuda graph (#11745 )	2025-10-24 10:31:39 +08:00
Lianmin Zheng	ffc722a690	Revert "lang: support direct video inference" (#12038 )	2025-10-23 19:21:31 -07:00
thelongestusernameofall	49afb3d9d9	Fix(security): block unsafe pickle deserialization to mitigate CVE-2025-10164 (#11909 ) Co-authored-by: Chengxing Xie <xiechengxing34@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-10-23 19:12:40 -07:00
b8zhong	f80371ff8c	Use flashinfer_trtllm moe runner backend to gain around 10% perf on b200 fp8 dpsk (#11816 )	2025-10-23 19:12:15 -07:00
Jonah Bernard	62eff37ba1	Refactor Triton-kernel MoE runner integration (#11795 )	2025-10-23 18:47:28 -07:00
b8zhong	47e12e082e	Enable Llama 4 + TRTLLM MHA (#12003 )	2025-10-23 18:22:58 -07:00
Mick	823b442945	lang: support direct video inference (#9936 ) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>	2025-10-23 18:12:39 -07:00
Fan Yin	14a4d80e57	[8/n] decouple quantization impl from vllm dependency - gguf srt (#11964 ) Co-authored-by: Peng Zhang <zhuangsen.zp@antgroup.com>	2025-10-23 18:12:00 -07:00

1 2 3 4 5 ...

4180 Commits