sglang

mirror of https://github.com/kvcache-ai/sglang.git synced 2026-07-01 12:17:09 +00:00

Author	SHA1	Message	Date
Baizhou Zhang	587deb15a7	[hotfix] Fix pytest not found in CI (#12311 )	2025-10-29 11:07:36 +08:00
Cheng Wan	83087247d1	[hotfix] missing `w13_weight_fp8` and `w2_weight_fp8` in UE8M0 requantization (#12259 )	2025-10-28 19:10:38 -07:00
Xiaoyu Zhang	334543ff3b	Add continuous_usage_stats support for streaming responses (#12241 )	2025-10-29 10:01:23 +08:00
b8zhong	c143f416ce	fix: Llama 4 BF16 load on Blackwell (#12308 )	2025-10-28 18:59:01 -07:00
fzyzcjy	29195aaa6e	Super tiny fix expert distribution dump error (#12271 )	2025-10-28 15:20:55 -07:00
bmac3	8d6ab1cb88	fix seqlen bug for trtllm_mla's draft_extend (#12295 )	2025-10-28 14:47:47 -07:00
b8zhong	77225d602a	Use Flashinfer TRT-LLM as Llama 4 compatible MoE backend (#11928 )	2025-10-28 10:39:43 -07:00
Trevor Morris	fdd00295b5	Fix 'BypassedTopKOutput' object has no attribute 'topk_weights' for DeepEP (#12231 )	2025-10-28 09:28:25 -07:00
Yineng Zhang	64cf868eba	chore: cleanup quant deps (#12268 )	2025-10-28 02:03:57 -07:00
Yineng Zhang	ea39952797	Revert "[Feature] PD-Multiplexing Context and Scheduler." (#12267 )	2025-10-28 02:00:37 -07:00
Shangming Cai	41a113356a	Fix potential eos bug on decode instance when PD is enabled (#12206 ) Signed-off-by: Shangming Cai <csmthu@gmail.com>	2025-10-28 01:29:02 -07:00
Xuchun Shang	a1f2dc90e4	[Bug fix] [PP] fix wrong dtype for quantified model (#12247 ) Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>	2025-10-28 01:27:24 -07:00
Feng Su	ea96106000	[Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 2 (#10804 ) Signed-off-by: Feng Su <sufeng@linux.alibaba.com>	2025-10-28 01:25:46 -07:00
Cheng Wan	b1e13e7cea	[hotfix] Incorrect CombineOverlapArgs in SBO (#12230 )	2025-10-28 01:23:06 -07:00
Chenxi Li	cc7b04a29c	Feature/Add GET endpoint to query loaded LoRA adapters (#12229 )	2025-10-28 01:22:00 -07:00
fzyzcjy	691c8534cf	Support releasing CUDA graph memory when paused (#7873 ) Co-authored-by: ryang-max <y1cunhui.yang@gmail.com> Co-authored-by: ryang <38470282+ryang-max@users.noreply.github.com>	2025-10-28 14:40:50 +08:00
Yongfei Xu	d2b8c4123e	Opt fused triton moe: add tma for down proj kernel (#10567 ) Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com>	2025-10-28 14:26:17 +08:00
Scott Lee	bf8f7a944f	Add per-request retraction count (#11177 )	2025-10-27 23:22:34 -07:00
hlu1	81a632ace6	[DeepseekV32] Enable flashmla_prefill kernel with fp8 kvcache (#11655 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>	2025-10-27 23:11:48 -07:00
ishandhanani	285a8e6986	docker: add CUDA13 support in dockerfile and update GDRCopy/NVSHMEM for blackwell support (#11517 ) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>	2025-10-27 22:00:54 -07:00
Yuan Luo	813bd6f85c	[2/2] Use moe_sum_reduce cuda kernel (#10654 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com>	2025-10-28 12:01:57 +08:00
Xinyuan Tong	729f612dc6	Update openai package version to 2.6.1 (#12222 )	2025-10-28 11:23:40 +08:00
jianan-gu	899453ac50	Use explicit uint64 dtype for Tensor data_ptr() to avoid overflow (#11994 )	2025-10-27 19:05:57 -07:00
Lifu Huang	ce832d7034	Add env var to control custom Triton kernel cache and set CSGMV as default backend. (#12176 )	2025-10-27 17:49:32 -07:00
weiliang	88596739a4	Support running FP4 Deepseek on SM120. (#11708 )	2025-10-27 17:37:49 -07:00
Yineng Zhang	a6ea3add76	[Auto Sync] Update scheduler.py, spec_info.py, run_suite.py... (20251027) (#12235 ) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: gongwei-130 <56567052+gongwei-130@users.noreply.github.com>	2025-10-27 17:21:08 -07:00
fzyzcjy	326c84c493	Compiling rope while preserving true on policy (#12161 )	2025-10-28 08:02:17 +08:00
gongwei-130	8da608cce0	fix: AttributeError: 'NixlKVManager' object has no attribute 'prefill_tp_size_table' (#12234 )	2025-10-27 15:57:33 -07:00
satyamk7054	9fc3e8aac7	Add support for Matryoshka embeddings (#126 ) (#11142 ) Co-authored-by: Satyam Kumar <satyamk@linkedin.com>	2025-10-28 02:49:36 +08:00
Chunyuan WU	c11b34d599	rope xpu: fix missing argument 'fused_set_kv_buffer_arg' and replace native with sgl_kernel_xpu impl (#12006 )	2025-10-28 01:02:18 +08:00
ykcombat	05ad28f25e	[Feature] PD-Multiplexing Context and Scheduler. (#11592 )	2025-10-28 00:54:43 +08:00
pansicheng	0cae873fcd	check_offload_progress more frequently (#11656 )	2025-10-28 00:37:38 +08:00
Haichao Zhu	a8b91f6b2d	improve mimax-m2 rmsnorm precision (#12186 )	2025-10-28 00:01:42 +08:00
Jimmy	959d1ab84b	fix(metrics): double times add_latency for DECODE_BOOTSTRAP (#12209 )	2025-10-27 23:59:48 +08:00
Muqi Li	6c1c193308	[Detokenizer Manager] Cleanup state when reqs are finished (#12205 )	2025-10-27 23:59:09 +08:00
cctry	3029d30189	Fix crash after flush cache (#12107 ) Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>	2025-10-27 23:52:27 +08:00
Yuan Luo	f389f01714	Optimize triton_mrope with torch compile (#12112 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>	2025-10-27 23:49:22 +08:00
Weiwei	caa4819bfc	Add support for AutoRound quantized models (#10153 )	2025-10-27 18:17:29 +08:00
Yuxuan Zhang	a88b006ecf	GLM-4-0414 and GLM-4.1V Code Refactor (#12117 )	2025-10-27 16:57:07 +08:00
sglang-bot	55d75e11bd	chore: bump SGLang version to 0.5.4.post1 (#12169 )	2025-10-27 09:35:20 +08:00
Chang Su	94aad0de99	[misc][grpc] Remove duplicate log (#12168 )	2025-10-26 14:06:59 -07:00
赵晨阳	7ebc28f5d6	[WIP] support MiniMax M2 model (#12129 ) Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Signed-off-by: xuebi <xuebi@minimaxi.com> Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: Roger Young <42564206+rogeryoungh@users.noreply.github.com> Co-authored-by: xuebi <xuebi@minimaxi.com>	2025-10-26 13:58:54 -07:00
ash-sigh	0b3b3e9a69	transfer mrope_position_delta to device when first running (#11047 )	2025-10-26 13:06:09 -07:00
Elfie Guo	a1d5bc4cce	Avoid using flashinfer_allreduce_fusion when dp attention is enabled. (#11632 )	2025-10-26 12:31:14 -07:00
Zijian Zhang	a8023891f6	model: support NVILA and NVILA Lite (#10399 )	2025-10-26 09:58:09 -07:00
fzyzcjy	0103f374ba	Support DeepGEMM for deterministic inference (#12142 )	2025-10-26 22:36:17 +08:00
zyksir	96a5a949f6	[Fix] fix allreduce bug in Piecewise Graph (#12106 )	2025-10-26 21:15:48 +08:00
Liangsheng Yin	ea385ae85a	Fix ITL metrics when using openai endpoint with spec (#12156 )	2025-10-26 18:06:25 +08:00
Kai-Hsun Chen	6371f7af27	[quantization] AWQ Marlin doesn't work when dtype is bfloat16 (#11494 ) Signed-off-by: Kai-Hsun Chen <khchen@x.ai> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>	2025-10-26 15:49:45 +08:00
Liangsheng Yin	8491c794ad	[misc] depdencies & enviroment flag (#12113 )	2025-10-26 14:52:35 +08:00

1 2 3 4 5 ...

4223 Commits