sglang

mirror of https://github.com/kvcache-ai/sglang.git synced 2026-07-02 04:37:14 +00:00

Author	SHA1	Message	Date
Eva20150932-atlascloud	7c38eca1e4	feat: DeepSeek new v3.2 encoding (#14249 ) Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>	2025-12-02 11:41:05 -08:00
Quanfeng Li	427b08e24d	Init TBO with dp_padded batch (#11423 ) Co-authored-by: Cheng Wan <wan4ch@gmail.com> Co-authored-by: Yuhao Yao <37280700+yuhyao@users.noreply.github.com>	2025-12-02 10:34:26 -08:00
alisonshao	0141ca370f	Revert PR #14044 : Restore separate memory pool for piecewise CUDA graph (#14278 )	2025-12-02 09:53:16 -08:00
alisonshao	25a6be4930	Fix duplicate download log messages in multi-process environment (#14299 )	2025-12-02 09:33:18 -08:00
Mick	9530b76630	[diffusion] refactor: simplify DmdDenoisingStage (#14269 )	2025-12-02 18:59:40 +08:00
Jinyan Chen	3067b3f050	[diffusion] chore: improve model info registration and searching strategy (#14281 ) Co-authored-by: Jinyan Chen <jinyanc@nvidia.com> Co-authored-by: Mick <mickjagger19@icloud.com>	2025-12-02 18:28:59 +08:00
Lianmin Zheng	64092c8b55	[Auto Sync] Rename is_hybrid to is_hybrid_swa (#14252 ) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com> Co-authored-by: Hanming Lu <hanming@x.ai>	2025-12-01 23:24:24 -08:00
sglang-bot	63b9300f00	chore: bump sgl-kernel version to 0.3.18.post2 (#14244 )	2025-12-01 23:14:12 -08:00
b8zhong	236a7c2370	fix trtllm mla spec (#13738 ) Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com>	2025-12-01 22:16:25 -08:00
Roger Young	3dabd609fb	Optimize topk sigmoid in minimax_m2 (#14047 ) Co-authored-by: xuebi <xuebi@minimaxi.com>	2025-12-02 14:07:12 +08:00
Kevin Li	c9e2090101	fix: Support PP for Mistral Small 3.1 (#14254 )	2025-12-02 13:04:14 +08:00
kun-llfl	106df4eac5	Fix mrope_positions size when req is retracted (#13700 ) Signed-off-by: Kun(llfl) <i@imux.top> Co-authored-by: Xuchun Shang <xuchun.shang@gmail.com>	2025-12-02 11:38:20 +08:00
Mick	1f930cd23d	[diffusion] CI: add testcase-wise retry mechanism (#14261 )	2025-12-02 11:06:12 +08:00
Kartik Ramesh	11ce05163d	Fix NIXL exception message (#14172 )	2025-12-02 10:39:45 +08:00
Stefan He	8fe8b63576	Revert "Try to remove wrong logic about max total token in spec decoding" (#14259 )	2025-12-01 18:18:03 -08:00
Yuan Luo	26aebf83d3	[VLM] Support Piecewise CUDA Graph for Qwen3-Omni-MOE (#14222 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>	2025-12-02 10:12:10 +08:00
Mick	3ab8ae6847	[diffusion] fix: fix Flux.2 condition image resize (#14232 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-12-02 10:05:44 +08:00
Lianmin Zheng	796d82b107	[Auto Sync] Add max_total_num_tokens metric: Update scheduler_metrics_mixin.py, collector.py (20251202) (#14256 ) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Dan Zheng <dzheng@x.ai>	2025-12-01 16:34:34 -08:00
Lianmin Zheng	1da59e8304	[Auto Sync] optionally disable fake register in Update fp8_kernel.py (20251202) (#14255 ) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: gauravjain14 <41287729+gauravjain14@users.noreply.github.com>	2025-12-01 16:11:12 -08:00
TomerBN-Nvidia	02af51e4fc	Support fp4 fp8 non gated moe (#13794 ) Co-authored-by: Roi Koren <roik@nvidia.com> Co-authored-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>	2025-12-01 15:26:28 -08:00
Zhiyu	079b173853	Fix a distributed initialization error (#13843 ) Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>	2025-12-01 15:10:05 -08:00
YAMY	1f2b84d28d	Fix NSA Bug in Centralize NSA Dispatch Logic (#14245 )	2025-12-01 13:18:18 -08:00
ishandhanani	07821352fb	Revert "Skip weight loading in deepgemm compilation" (#14241 )	2025-12-01 12:59:09 -08:00
Byron Hsu	edbeaf3b88	[MM][style] rename inputs_embeds to input_embeds for consistency (#14240 )	2025-12-01 11:36:51 -08:00
liupeng374	2e8f54e61e	[spec-overlap] bugfix for pd disaggregation and npu (#14088 ) Co-authored-by: Even Zhou <even.y.zhou@outlook.com>	2025-12-01 22:58:20 +08:00
fzyzcjy	45264554f3	Super tiny fix typo (#14219 )	2025-12-01 20:19:17 +08:00
Liangsheng Yin	a2423052f6	Add cuda event based on waiting value (#14214 )	2025-12-01 18:51:44 +08:00
Lianmin Zheng	bc3d2a85af	[Minor] update docs (#14212 )	2025-12-01 02:33:58 -08:00
fzyzcjy	d815d00248	Tiny call cudaProfilerStart only on first rank in node (#14211 )	2025-12-01 18:18:45 +08:00
Xiaoyu Zhang	fa9021b21f	fix: Increase FlashInfer workspace size for Qwen3VL models (#14173 )	2025-12-01 17:54:23 +08:00
Xiaoyu Zhang	9c80072845	Add peak output tokens per second in bench_serving (#14165 ) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2025-12-01 17:47:54 +08:00
Yuan Luo	630a693081	[VLM] Boost Memory Pool based CUDA IPC (#14123 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>	2025-12-01 17:17:46 +08:00
Mick	7ce8faae28	[diffusion] refactor: remove hard-code of instanceof on PipelineConfig (#14186 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-12-01 16:35:34 +08:00
fzyzcjy	de153cf76a	Fix speculative decoding error when retracting (#14180 )	2025-12-01 15:30:13 +08:00
fzyzcjy	f4a0c5c76b	Try to remove wrong logic about max total token in spec decoding (#14167 )	2025-12-01 15:29:58 +08:00
Binyao Jiang	0f8e53947d	[Piecewise] Use same global graph memory pool as the main cuda graph … (#14044 ) Co-authored-by: Stefan He <hebiaobuaa@gmail.com> Co-authored-by: BBuf <1182563586@qq.com>	2025-11-30 23:04:10 -08:00
fzyzcjy	e8ba5a668c	Support profiling only prefill or decode without the other (#14182 )	2025-12-01 14:46:30 +08:00
fzyzcjy	a2960bdd6b	Super tiny allow millisecond precision in logging (#14183 )	2025-12-01 14:46:09 +08:00
fzyzcjy	487c8d4df3	Tiny add several args to bench serving (#14181 )	2025-12-01 14:45:47 +08:00
fzyzcjy	f87b8eab23	Tiny fix transform_scale_ue8m0 wrong output in some scenarios (#14003 )	2025-12-01 14:45:27 +08:00
Minglei Zhu	e8542db558	[piecewise] move piecewise_cuda_graph_runner init to model_runner initialize (#14034 ) Co-authored-by: Stefan He <hebiaobuaa@gmail.com> Co-authored-by: Binyao Jiang <byjiang1996@gmail.com>	2025-11-30 22:16:04 -08:00
Lianmin Zheng	6df1e8d628	[Auto Sync] Update backend.py (20251130) (#14153 ) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Stefan He <hebiaobuaa@gmail.com> Co-authored-by: Byron Hsu <byronhsu1230@gmail.com>	2025-11-30 22:15:02 -08:00
qichu-yun	bd0e690857	[Feature] Enable PTPC FP8 for compressed tensors moe (aiter kernel) (#12181 )	2025-11-30 21:54:28 -08:00
Byron Hsu	0825d7f4c6	[piecewise] Refactor VLM to support input embed buffer and remove external embedder hack (#14155 )	2025-11-30 21:43:09 -08:00
Yuhao Yang	0b9dbea593	[diffusion] chore: improve z-image (#14104 )	2025-12-01 12:26:17 +08:00
Uranus	982db4ebac	Feat: GLM-4.6 supports shared experts fusion (#13873 ) Signed-off-by: UranusSeven <109661872+UranusSeven@users.noreply.github.com> Co-authored-by: Kevin-XiongC <kevin_xiong1997@outlook.com> Co-authored-by: Mingyi Jin <jinmingyi1998@sina.cn>	2025-12-01 11:33:18 +08:00
Teng Ma	f5f3a5d98c	[PD] Support json file configuration for Transfer Engine (#14059 ) Co-authored-by: Shangming Cai <csmthu@gmail.com>	2025-12-01 10:47:33 +08:00
YAMY	decb48965d	[DeepSeekV3.2] Enable pure TP & Partial DP Attention (#13646 )	2025-11-30 15:59:23 -08:00
Fan Yin	c72f0756d2	Fix: fix flashmla fp8 kv cache acc error (#13841 ) Co-authored-by: ybyang <ybyang7@iflytek.com>	2025-11-30 13:38:19 -08:00
Baizhou Zhang	f1115cf58d	Revert "[Minor]Raise Error when deepep num dispatch token per rank is smaller than cuda graph bs" (#14171 )	2025-11-30 12:49:46 -08:00

1 2 3 4 5 ...

4892 Commits