sglang

mirror of https://github.com/kvcache-ai/sglang.git synced 2026-07-01 20:27:57 +00:00

Author	SHA1	Message	Date
Yuan Luo	7ef5d8afd4	Revise POINTSV15Chat model (#12049 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>	2025-10-24 17:09:45 +08:00
Qiaolin Yu	71d41212e4	Fix dpsk-r1-fp4 launching crash (#12063 )	2025-10-24 17:04:50 +08:00
Xinyuan Tong	b9fb74f3bc	fix: bench_serving ITL calculation when using spec-decoding (#12064 )	2025-10-24 17:02:44 +08:00
ybyang	e15b63a182	[Fix] fix missing `ipc_name` of `__getitem__` in some IO structs (#12053 ) Signed-off-by: ybyang <ybyang7@iflytek.com>	2025-10-24 16:59:14 +08:00
Yuxuan Zhang	4060ed37cb	Refactoring GLM-4.5 and GLM-4.5V related implementations (#11800 )	2025-10-24 08:22:36 +00:00
fzyzcjy	2342605ef0	Tiny cleanup send_single (#12056 )	2025-10-23 23:53:42 -07:00
Rain Jiang	8e797a47f0	fix: the hardcode hf repo name comparison for deepseek-ocr (#12031 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-10-23 21:37:56 -07:00
Zaili Wang	aa3003f116	Add gguf dependency for cpu/xpu (#12041 )	2025-10-23 21:13:17 -07:00
Yongfei Xu	4793ec7d1a	Opt MHA chunked prefix: merge prefix and extend kv cache to run mha once (#10953 )	2025-10-23 20:58:10 -07:00
Zaili Wang	92009bd28e	fix: fix MMMU loading issue (#11759 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-10-23 20:21:38 -07:00
Baizhou Zhang	4ef981e2b6	Revert "[Fix] Fix lint to pass CI" (#12042 )	2025-10-23 19:44:58 -07:00
Baizhou Zhang	69ed8b67a8	[Fix] Fix lint to pass CI (#12037 )	2025-10-23 19:39:38 -07:00
narutolhy	1801cd199f	support more model in piecewise cuda graph (#11745 )	2025-10-24 10:31:39 +08:00
Lianmin Zheng	ffc722a690	Revert "lang: support direct video inference" (#12038 )	2025-10-23 19:21:31 -07:00
thelongestusernameofall	49afb3d9d9	Fix(security): block unsafe pickle deserialization to mitigate CVE-2025-10164 (#11909 ) Co-authored-by: Chengxing Xie <xiechengxing34@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-10-23 19:12:40 -07:00
b8zhong	f80371ff8c	Use flashinfer_trtllm moe runner backend to gain around 10% perf on b200 fp8 dpsk (#11816 )	2025-10-23 19:12:15 -07:00
Jonah Bernard	62eff37ba1	Refactor Triton-kernel MoE runner integration (#11795 )	2025-10-23 18:47:28 -07:00
b8zhong	47e12e082e	Enable Llama 4 + TRTLLM MHA (#12003 )	2025-10-23 18:22:58 -07:00
Mick	823b442945	lang: support direct video inference (#9936 ) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>	2025-10-23 18:12:39 -07:00
Fan Yin	14a4d80e57	[8/n] decouple quantization impl from vllm dependency - gguf srt (#11964 ) Co-authored-by: Peng Zhang <zhuangsen.zp@antgroup.com>	2025-10-23 18:12:00 -07:00
sglang-bot	1053e1be17	chore: bump SGLang version to 0.5.4 (#12027 ) Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com>	2025-10-23 18:01:40 -07:00
Roger Young	dbd9435dc1	Fix mamba radix cache eviction logic in `alloc_req_slots` (#11616 ) Signed-off-by: rogeryoungh <rogeryoungh@foxmail.com>	2025-10-23 13:07:43 -07:00
b8zhong	8ae9d4bb41	Revert "[ROCm] Remove vLLM rope dependency & use AITER impl" (#12028 )	2025-10-23 12:42:59 -07:00
Nicolas Castet	1c304aa9bc	Log iteration # for prefill and decode (#9366 )	2025-10-23 12:28:03 -07:00
Mick	770529a731	model: support deepseek-ocr (#11891 ) Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com> Co-authored-by: yhyang201 <yhyang201@gmail.com> Co-authored-by: Shi Shuai <126407087+shuaills@users.noreply.github.com> Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>	2025-10-24 03:15:17 +08:00
ErvinXie	39c237f02c	Add AWQ quantization support for NPU. (#10158 ) Co-authored-by: Alisehen <814073252@qq.com> Co-authored-by: Yaochen Han <48639761+Alisehen@users.noreply.github.com> Co-authored-by: Zhengda Qin <zhengdqin@gmail.com>	2025-10-23 12:08:05 -07:00
Lianmin Zheng	ab07cd3e5a	[Auto Sync] Update test_deterministic_utils.py (20251023) (#12022 ) Co-authored-by: Stefan He <hebiaobuaa@gmail.com>	2025-10-23 11:20:45 -07:00
Netanel Haber	a98496834b	Feature/nano v2 offline modelopt fp8 and nvfp4 (#12018 ) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>	2025-10-23 11:16:46 -07:00
Teng Ma	96a5e4dd79	[Feature] Support loading weights from ckpt engine worker (#11755 ) Signed-off-by: Yang Kaiyong <yangkaiyong.yky@antgroup.com> Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com> Co-authored-by: Yang Kaiyong <yangkaiyong.yky@antgroup.com> Co-authored-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Co-authored-by: Xuchun Shang <xuchun.shang@gmail.com> Co-authored-by: Shangming Cai <csmthu@gmail.com>	2025-10-23 09:23:30 -07:00
cctry	b0b4f71679	[Fix] memory leak by overlap + retract (#11981 ) Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>	2025-10-23 22:59:23 +08:00
Liangsheng Yin	6c18addb6f	Revert "Support nvidia/NVIDIA-Nemotron-Nano-9B-v2-FP8/NVFP4" (#12015 )	2025-10-23 21:27:58 +08:00
Liangsheng Yin	32852fe9e9	Move memory runtime checker to mixin class (#12014 )	2025-10-23 20:53:26 +08:00
Netanel Haber	d6fee73d1f	Support nvidia/NVIDIA-Nemotron-Nano-9B-v2-FP8/NVFP4 (#11866 )	2025-10-23 17:29:02 +08:00
Qiaolin Yu	36a4cad7b0	Support overlap-spec-v2 with trtllm_mla attention backend (#11821 )	2025-10-23 16:55:35 +08:00
yinghui	c23eda8589	Fix incorrect KV indices creation when page_size=32 in TRTLLM MLA backend (#11985 )	2025-10-22 22:44:45 -07:00
Jue WANG	138ff23187	Allow to disable batch decoding. (#11944 )	2025-10-22 21:57:12 -07:00
blzheng	13fb8b5489	[CPU] Optimize FP16 decode_attention_cpu (#10652 )	2025-10-22 21:39:51 -07:00
Zaili Wang	007b849b0e	[CPU] misc updates (#11906 )	2025-10-22 21:10:05 -07:00
Johnny	e7aa4664b3	[NVIDIA] Build CUDA 13 (#11299 ) Co-authored-by: ishandhanani <ishandhanani@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>	2025-10-22 20:03:12 -07:00
b8zhong	4d4feccbb2	[ROCm] Remove vLLM rope dependency & use AITER impl (#11322 )	2025-10-22 19:17:34 -07:00
jacky.cheng	99c92ff24b	[AMD] Support a new flag to disable quant on parallelLinear layer if required (#11811 )	2025-10-22 19:16:15 -07:00
Chang Su	6ade6a02d4	[grpc] Support gRPC standard health check (#11955 )	2025-10-22 16:59:09 -07:00
Christian Bahls	164302c7df	Implement BGE-M3 Sparse Embeddings in SGLang (#10869 ) Co-authored-by: Christian Bahls <christian.bahls@planet-ai.de> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-10-22 13:46:16 -07:00
jiahanc	eec9e471ca	[NVIDIA] Update to leverage flashinfer trtllm FP4 MOE throughput kernel (#11563 ) Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>	2025-10-22 13:11:16 -07:00
Lianmin Zheng	6d535b719f	Revert "Recapture cuda graph after model weight update to resolve IMA error " (#11980 )	2025-10-22 11:50:26 -07:00
yuho	fdcb1d13c5	[BUG] AttributeError: 'DeepEPMoE' object has no attribute 'use_w4a… (#11977 )	2025-10-22 11:29:55 -07:00
Hongbo Xu	d7e834d6ba	[6/n]decouple quantization implementation from vLLM dependency (#10750 )	2025-10-23 02:07:55 +08:00
Fan Yin	1d097aac87	[Fix] Remove unused import from triton_kernels_moe.py (#11967 ) Co-authored-by: Shangming Cai <171321666+shangmingcai@users.noreply.github.com>	2025-10-22 21:02:57 +08:00
996_icu	88568c01eb	[model] Support POINTSV15Chat (#9651 ) Co-authored-by: josephyou <josephyou@tencent.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: root <root@TENCENT64.site>	2025-10-22 16:58:17 +08:00
Hank Han	904655c5fd	[2/N] Added the core structure of elastic EP and the eplb algorithm with faulty rank (#10606 ) Co-authored-by: Xun Sun <UNIDY2002@outlook.com> Co-authored-by: Shangming Cai <csmthu@gmail.com>	2025-10-22 01:13:31 -07:00

1 2 3 4 5 ...

4150 Commits