sglang

mirror of https://github.com/kvcache-ai/sglang.git synced 2026-07-02 21:37:11 +00:00

Author	SHA1	Message	Date
Mick	1f930cd23d	[diffusion] CI: add testcase-wise retry mechanism (#14261 )	2025-12-02 11:06:12 +08:00
Kartik Ramesh	11ce05163d	Fix NIXL exception message (#14172 )	2025-12-02 10:39:45 +08:00
Stefan He	8fe8b63576	Revert "Try to remove wrong logic about max total token in spec decoding" (#14259 )	2025-12-01 18:18:03 -08:00
Yuan Luo	26aebf83d3	[VLM] Support Piecewise CUDA Graph for Qwen3-Omni-MOE (#14222 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>	2025-12-02 10:12:10 +08:00
Mick	3ab8ae6847	[diffusion] fix: fix Flux.2 condition image resize (#14232 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-12-02 10:05:44 +08:00
Lianmin Zheng	796d82b107	[Auto Sync] Add max_total_num_tokens metric: Update scheduler_metrics_mixin.py, collector.py (20251202) (#14256 ) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Dan Zheng <dzheng@x.ai>	2025-12-01 16:34:34 -08:00
Lianmin Zheng	1da59e8304	[Auto Sync] optionally disable fake register in Update fp8_kernel.py (20251202) (#14255 ) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: gauravjain14 <41287729+gauravjain14@users.noreply.github.com>	2025-12-01 16:11:12 -08:00
TomerBN-Nvidia	02af51e4fc	Support fp4 fp8 non gated moe (#13794 ) Co-authored-by: Roi Koren <roik@nvidia.com> Co-authored-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>	2025-12-01 15:26:28 -08:00
Zhiyu	079b173853	Fix a distributed initialization error (#13843 ) Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>	2025-12-01 15:10:05 -08:00
YAMY	1f2b84d28d	Fix NSA Bug in Centralize NSA Dispatch Logic (#14245 )	2025-12-01 13:18:18 -08:00
ishandhanani	07821352fb	Revert "Skip weight loading in deepgemm compilation" (#14241 )	2025-12-01 12:59:09 -08:00
Byron Hsu	edbeaf3b88	[MM][style] rename inputs_embeds to input_embeds for consistency (#14240 )	2025-12-01 11:36:51 -08:00
liupeng374	2e8f54e61e	[spec-overlap] bugfix for pd disaggregation and npu (#14088 ) Co-authored-by: Even Zhou <even.y.zhou@outlook.com>	2025-12-01 22:58:20 +08:00
fzyzcjy	45264554f3	Super tiny fix typo (#14219 )	2025-12-01 20:19:17 +08:00
Liangsheng Yin	a2423052f6	Add cuda event based on waiting value (#14214 )	2025-12-01 18:51:44 +08:00
Lianmin Zheng	bc3d2a85af	[Minor] update docs (#14212 )	2025-12-01 02:33:58 -08:00
fzyzcjy	d815d00248	Tiny call cudaProfilerStart only on first rank in node (#14211 )	2025-12-01 18:18:45 +08:00
Xiaoyu Zhang	fa9021b21f	fix: Increase FlashInfer workspace size for Qwen3VL models (#14173 )	2025-12-01 17:54:23 +08:00
Xiaoyu Zhang	9c80072845	Add peak output tokens per second in bench_serving (#14165 ) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2025-12-01 17:47:54 +08:00
Yuan Luo	630a693081	[VLM] Boost Memory Pool based CUDA IPC (#14123 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>	2025-12-01 17:17:46 +08:00
Mick	7ce8faae28	[diffusion] refactor: remove hard-code of instanceof on PipelineConfig (#14186 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-12-01 16:35:34 +08:00
fzyzcjy	de153cf76a	Fix speculative decoding error when retracting (#14180 )	2025-12-01 15:30:13 +08:00
fzyzcjy	f4a0c5c76b	Try to remove wrong logic about max total token in spec decoding (#14167 )	2025-12-01 15:29:58 +08:00
Binyao Jiang	0f8e53947d	[Piecewise] Use same global graph memory pool as the main cuda graph … (#14044 ) Co-authored-by: Stefan He <hebiaobuaa@gmail.com> Co-authored-by: BBuf <1182563586@qq.com>	2025-11-30 23:04:10 -08:00
fzyzcjy	e8ba5a668c	Support profiling only prefill or decode without the other (#14182 )	2025-12-01 14:46:30 +08:00
fzyzcjy	a2960bdd6b	Super tiny allow millisecond precision in logging (#14183 )	2025-12-01 14:46:09 +08:00
fzyzcjy	487c8d4df3	Tiny add several args to bench serving (#14181 )	2025-12-01 14:45:47 +08:00
fzyzcjy	f87b8eab23	Tiny fix transform_scale_ue8m0 wrong output in some scenarios (#14003 )	2025-12-01 14:45:27 +08:00
Minglei Zhu	e8542db558	[piecewise] move piecewise_cuda_graph_runner init to model_runner initialize (#14034 ) Co-authored-by: Stefan He <hebiaobuaa@gmail.com> Co-authored-by: Binyao Jiang <byjiang1996@gmail.com>	2025-11-30 22:16:04 -08:00
Lianmin Zheng	6df1e8d628	[Auto Sync] Update backend.py (20251130) (#14153 ) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Stefan He <hebiaobuaa@gmail.com> Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>	2025-11-30 22:15:02 -08:00
qichu-yun	bd0e690857	[Feature] Enable PTPC FP8 for compressed tensors moe (aiter kernel) (#12181 )	2025-11-30 21:54:28 -08:00
Byron Hsu	0825d7f4c6	[piecewise] Refactor VLM to support input embed buffer and remove external embedder hack (#14155 )	2025-11-30 21:43:09 -08:00
Yuhao Yang	0b9dbea593	[diffusion] chore: improve z-image (#14104 )	2025-12-01 12:26:17 +08:00
Uranus	982db4ebac	Feat: GLM-4.6 supports shared experts fusion (#13873 ) Signed-off-by: UranusSeven <109661872+UranusSeven@users.noreply.github.com> Co-authored-by: Kevin-XiongC <kevin_xiong1997@outlook.com> Co-authored-by: Mingyi Jin <jinmingyi1998@sina.cn>	2025-12-01 11:33:18 +08:00
Teng Ma	f5f3a5d98c	[PD] Support json file configuration for Transfer Engine (#14059 ) Co-authored-by: Shangming Cai <csmthu@gmail.com>	2025-12-01 10:47:33 +08:00
YAMY	decb48965d	[DeepSeekV3.2] Enable pure TP & Partial DP Attention (#13646 )	2025-11-30 15:59:23 -08:00
Fan Yin	c72f0756d2	Fix: fix flashmla fp8 kv cache acc error (#13841 ) Co-authored-by: ybyang <ybyang7@iflytek.com>	2025-11-30 13:38:19 -08:00
Baizhou Zhang	f1115cf58d	Revert "[Minor]Raise Error when deepep num dispatch token per rank is smaller than cuda graph bs" (#14171 )	2025-11-30 12:49:46 -08:00
Baizhou Zhang	7b03cc6482	[Minor]Raise Error when deepep num dispatch token per rank is smaller than cuda graph bs (#14065 )	2025-11-30 10:11:42 -08:00
Dongjoo Seo	c15c864b6f	Fix LMCache unit test and init bug (#14005 ) Signed-off-by: DongDongJu <commisori28@gmail.com>	2025-11-30 23:57:32 +08:00
PiteXChen	dc7bdc7329	bugfix[schedule]: Excessive preemption occurs when preempting running requests to schedule new prefill requests. (#12494 ) Signed-off-by: CLFutureX <chenyongqyl@163.com>	2025-11-30 22:29:26 +08:00
Liangsheng Yin	0a9d64530d	Support grammar + spec + reasoning (#14163 )	2025-11-30 21:19:57 +08:00
fzyzcjy	340c613ab5	Support numactl bind for CPU and memory before process starts (#14156 )	2025-11-30 17:00:33 +08:00
fzyzcjy	36b729c2b8	Implement profiler v2 and fix stage mixture bug (#14148 )	2025-11-30 16:59:52 +08:00
Tianhao Zhou	67e6ef4b2d	feat: longcat flash add aux layers capture for eagle3 (#14161 )	2025-11-30 00:50:55 -08:00
strgrb	65ba5ab8b1	add cpp files for cpp_radix_tree to pyproject.toml. (#14052 )	2025-11-30 13:05:04 +08:00
WenhaoZhang	990023e59b	[diffusion] lora: Fix LoRA weight merging for torch.nn.Linear layers from diffusers modules (#14150 ) Co-authored-by: niehen6174 <niehen.6174@gmail.com>	2025-11-30 12:44:12 +08:00
fzyzcjy	0ae4b1ad81	Show errors when misusing env variables (#14154 )	2025-11-30 10:57:35 +08:00
fzyzcjy	94cd64a7b0	Support checking fp8 params in weight_checker (#14147 )	2025-11-30 09:08:59 +08:00
fzyzcjy	b870271a50	Fix spec v2 does not support RL update weights from tensor (#14146 )	2025-11-30 09:08:05 +08:00

1 2 3 4 5 ...

4880 Commits