sglang

mirror of https://github.com/kvcache-ai/sglang.git synced 2026-07-03 22:07:12 +00:00

Author	SHA1	Message	Date
danielafrimi	3f1df322f9	[FIX] Always support TP > 4 for FP4 Gemm (#17300 )	2026-02-05 15:10:26 +08:00
Meng, Hengyu	368936a62b	[XPU] Integrate MoE and minor improvements in XPU attention backend (#13561 )	2026-02-04 23:09:59 -08:00
Xiaoyu Zhang	dff3ba202a	[Diffusion] Support layerwise offload for mova (#18272 )	2026-02-05 13:16:07 +08:00
Ch3ngY1	f730c18679	[PD] improve kv offset calculation for MHA model with different tp size (#18163 ) Co-authored-by: Shangming Cai <csmthu@gmail.com>	2026-02-05 10:43:23 +08:00
Mick	f218234e4f	[diffusion] chore: prohibit Chinese characters usage (#18249 )	2026-02-05 09:22:26 +08:00
yinghui	599c5f4922	fix kimi k2.5's moe gemm config init (#18064 )	2026-02-04 16:59:01 -08:00
linhaifeng	c1d5cc3b24	[Bugfix] fix a obvious logic error (#18254 )	2026-02-04 13:59:58 -08:00
Mohammad Miadh Angkad	efbf39583e	Add MoE fused config for Qwen3-Coder-Next-FP8 on H100 TP=2 (#18195 )	2026-02-04 13:36:35 -08:00
Zack Yu	2e87c2bd5e	fix: fix MockModelRunner in attention tests (#18240 )	2026-02-04 13:18:02 -08:00
Michael	6fd878b41d	[AMD] Add kimi mi35x nightly test, folder organization and several stability fixes (#17895 )	2026-02-04 12:03:57 -08:00
Mick	36a3e78af9	[diffusion] refactor: move model_stages into stages folder (#18248 )	2026-02-05 00:23:31 +08:00
RunningLeon	3e7ecb78a6	model: support interns1-pro (#18145 ) Co-authored-by: Ke Bao <ispobaoke@gmail.com>	2026-02-05 00:22:44 +08:00
RunningLeon	a6f53cc5e3	entrypoint: support passing spaces_between_special_tokens per request (#17939 )	2026-02-04 22:18:36 +08:00
wxy	4c403045ec	[diffusion] fix: fix the bug of redundant memory usage on GPU-0 (#18221 )	2026-02-04 21:25:23 +08:00
Zhang Yiyang (SII)	0c9a0adc53	[diffusion] chore: clean MOVA codes (#18107 )	2026-02-04 21:23:41 +08:00
BingjiaWang	760ae933bb	optimize get_topk_ragged by fusing get k and k_scale triton kernel (#16043 ) Co-authored-by: abing <wangbingjia.wbj@alibaba-inc.com>	2026-02-04 19:59:41 +08:00
Nicolas Castet	315306d8a9	Make sure we always disable symm memory without dp padding (#18129 )	2026-02-04 19:58:28 +08:00
Jincong Chen	a72f4f839c	Tiny fix for fp8 moe backend flashinfer_trtllm naming (#18243 )	2026-02-04 19:58:04 +08:00
Evrard-Nil	ce02df8592	[diffusion] logging: downgrade default prompt log from info to debug (#17813 ) Co-authored-by: Mick <mickjagger19@icloud.com>	2026-02-04 19:19:02 +08:00
Cheng Wan	84c09913eb	Moving _alloc_extend_naive out of npu allocator (#18200 )	2026-02-04 02:09:55 -08:00
zhangheng	be557cbc5f	[RadixTree][5/N Refactor]: Introduce pre and post-processing methods for key matching (#18147 )	2026-02-04 17:10:46 +08:00
Baizhou Zhang	d279520ba5	[DeepGemm] Add a flag for fast warmup (#18111 )	2026-02-04 14:12:13 +08:00
Jianying	4739f2e8d5	[diffusion] kernel: gated residual layernorm scale shift and layernorm scale shift kernel fusion for Qwen-Image, WAN and HunyuanVideo (#14717 ) Co-authored-by: AichenF <aichenf@nvidia.com> Co-authored-by: jianyingzhu <joeyzhu@nvidia.com> Co-authored-by: root <root@a4u8g-0120.ipp2a2.colossus.nvidia.com> Co-authored-by: Yihan Chen <yingluosanqian@example.com> Co-authored-by: 陈一涵 <yingluosanqian@gmail.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>	2026-02-04 13:46:20 +08:00
strgrb	37c33cc0aa	fuse qkvbfg linear into one gemm and f_b g_b into batched gemm. (#17801 )	2026-02-04 11:41:26 +08:00
Aurick Qiao	c1d529c196	Fix Session for multimodal and expose it through Engine (#18152 )	2026-02-04 10:33:27 +08:00
wxy	da758ed601	[diffusion] fix: fix server cache-dit bug under continuous dynamic requests (#17140 )	2026-02-04 09:03:37 +08:00
satyamk7054	793bf9fc06	Update weight rename check for Qwen3 Embeddings (#17535 )	2026-02-03 13:55:11 -08:00
Hudson Xing	e867040fc6	add streaming parallel tool call test case (#18097 )	2026-02-03 12:46:01 -08:00
R0CKSTAR	7de650c83c	[diffusion] hardware: support diffusion models on MTGPU (doc, 6/N) (#17346 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2026-02-03 12:44:57 -08:00
R0CKSTAR	ec2461bc16	[diffusion] hardware: support diffusion models on MTGPU (multi-GPU, 5/N) (#17318 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2026-02-03 12:44:22 -08:00
R0CKSTAR	acf724b036	[Diffusion] Only import sgl_kernel in custom op cuda path (SiluAndMul and RMSNorm) (#15592 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2026-02-03 12:42:58 -08:00
Vladislav Nosivskoy	e166ca8758	[HiCache] feat: Add detailed cache hit breakdown for HiCache in `sglext` and Prometheus metrics (#17648 ) Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com> Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>	2026-02-03 11:45:35 -08:00
Even Zhou	d48bbe3bed	[CI][NPU] Bugfix import sgl-kernel error (#18173 )	2026-02-03 11:39:38 -08:00
DiweiSun	495290aefd	enable ut test for xpu devices (#11712 ) Co-authored-by: jundu <jun.du@intel.com> Co-authored-by: Gao, Pengfei <pengfei.gao@intel.com>	2026-02-03 11:15:14 -08:00
elvischenv	99fab2ce67	[Bugfix] Fix Mistral Large 3 NVFP4 TRTLLM MoE (#18065 )	2026-02-03 20:32:49 +08:00
Lewis	a45647bce1	[PD] feat: support mooncake intra-node nvlink kv transfer (#17866 ) Co-authored-by: 百麒 <yaozhong.lyz@alibaba-inc.com> Co-authored-by: Teng Ma <teng-ma@linux.alibaba.com>	2026-02-03 17:47:52 +08:00
Xiaowei Wang	cc69ac9e7a	Warmup before profiling prefill latency for dynamic chunk sizing (#17198 ) Co-authored-by: Shangming Cai <csmthu@gmail.com>	2026-02-03 17:45:23 +08:00
Mohammad Miadh Angkad	6f6b9c6e42	[Perf] Use safetensors `load_file` in multithread loader (#18124 )	2026-02-02 23:21:13 -08:00
fatSheep	7a9d9c79d1	[HiCache] fix: apply extra_backend_tag in Mooncake batch_exists (#17265 )	2026-02-02 22:54:56 -08:00
Viacheslav	74f716dbd7	Gigachat 3 tool parser and tests (#14765 )	2026-02-02 22:28:34 -08:00
Kaixi Hou	4181290efd	[NVIDIA] Add --top-k argument to run_eval.py (#18025 ) Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 22:17:53 -08:00
b8zhong	78bf13db44	MoE Refactor: Refactor `modelopt_quant.py` -> `flashinfer_trllm.py` (#16685 ) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>	2026-02-02 20:45:14 -08:00
Xiaoyu Zhang	eedd472025	[Diffusion] fix serving image_edit get input image bug (#18109 )	2026-02-03 12:17:16 +08:00
Hank Han	e484c90cc7	Add triton_fused_moe config for GLM-4.7-FP8 tp8 H20 H20-3e (#18091 )	2026-02-03 12:08:23 +08:00
Linyu Wu	9b1619c148	[Move sgl-kernel Kernel to JIT] Add JIT concat MLA kernels (#17889 )	2026-02-03 10:49:17 +08:00
Mick	62004fd2be	[diffusion] UX: improve logging (#18122 )	2026-02-03 10:35:05 +08:00
zhangheng	180594358b	[HiCache]: Support DeepSeek v32 cpu offloading (#17415 ) Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>	2026-02-02 18:07:37 -08:00
Xiaoyu Zhang	a1bbc892af	[Diffsuion & JIT_kernel] QKNorm cross heads kernel (#18073 )	2026-02-03 10:03:17 +08:00
EkiRui	fd983b09b6	[Performance] Optimize radix cache eviction performance (#14339 ) Signed-off-by: Xingrui Yi <yixingrui@linux.alibaba.com> Co-authored-by: Xuchun Shang <xuchun.shang@gmail.com>	2026-02-03 09:44:20 +08:00
Alison Shao	28e2340725	Fix HF hub race condition in CI by coordinating model downloads across TP ranks (#17787 ) Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>	2026-02-02 14:57:45 -08:00

... 5 6 7 8 9 ...

6437 Commits