sglang

mirror of https://github.com/kvcache-ai/sglang.git synced 2026-07-01 20:27:57 +00:00

Author	SHA1	Message	Date
strgrb	738ebfd330	KDA: fuse qkv conv and support stride for fused_sigmoid_gating_delta_rule_update_kernel (#19506 )	2026-03-04 22:45:53 +08:00
YeChang Guo	6910c1b281	[Feature][NPU]: add runtime support for GPTQ-quantized MoE models (#16364 ) Co-authored-by: GuoYechang <52730608+GuoYechang@users.noreply.github.com> Co-authored-by: root <root@localhost.localdomain>	2026-03-04 16:02:19 +03:00
Shangming Cai	c2b66d320d	[HiCache] Add an env var to control transfer engine reuse (#19867 )	2026-03-04 20:36:32 +08:00
chenxu214	88cfa6c11d	[NPU]Releasing redundant memory of w13_weight and nz when the ascend_fuseep feature is enabled (#19813 )	2026-03-04 19:26:29 +08:00
sky	17119a697d	Optimization: Reduce the number of D2H operations (#19424 ) Signed-off-by: wangfakang <fakangwang@gmail.com>	2026-03-04 16:32:42 +08:00
Mohammad Miadh Angkad	09fa012ba7	Fix /health regression from early prebound socket listen (#19805 )	2026-03-03 23:00:46 -08:00
Yuhao Yang	115f879958	Helios: Real Real-Time Long Video Generation Model (#19782 )	2026-03-04 14:58:04 +08:00
qwe	562c3ff2d0	[Feature] implement the standard multi-layer MTP for step3p5 (#18564 ) Co-authored-by: mei ran <meiran0528@gmail.com> Co-authored-by: yhyang201 <yhyang201@gmail.com>	2026-03-03 22:48:53 -08:00
DefTruth	e9b5706545	[diffusion] feat: support torch compile for diffusers backend (#19673 )	2026-03-04 14:08:45 +08:00
Michael	c6850ac30c	[AMD] Fix Qwen3-Coder-Next: Add missing k_scale/v_scale args to extend_attention_fwd in aiter_backend (#19736 )	2026-03-03 22:01:08 -08:00
Jue Wang	5972f97f11	Remove naive rotary forward overriding. (#19263 )	2026-03-03 21:50:40 -08:00
ybyang	ac1f07487a	Fix triton alloc extend kernel (#19780 )	2026-03-03 21:01:16 -08:00
Bi Xue	73bf2c5bdc	[sgl]add pin_mem to remove cpu->gpu copy sync point (#19795 ) Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com>	2026-03-03 21:00:51 -08:00
sglang-bot	b7f7df7ee6	[NSA] Fix line-too-long lint in `can_nsa_prefill_cp_round_robin_split` (#19829 )	2026-03-03 20:34:22 -08:00
Yuhao Yang	ca44aa25af	Fix dp_attention crash when dp_size < tp_size in warmup dummy run (#19760 )	2026-03-03 19:43:13 -08:00
Ruihang Li	da9dcbc906	[diffusion] fix: fix corrupted image editing outputs in Multi-GPU SP mode for FLUX.2-klein models (#19454 )	2026-03-04 11:35:46 +08:00
Baidu-AIAK	6851613b93	[Bugfix] For cp: Fixed hang problem in prefix cache and kvcache support fp8 in-seq-split mode (#19656 ) Co-authored-by: vincent <vincent@vincentdeMacBook-Pro.local>	2026-03-03 19:19:46 -08:00
Xiaoyu Zhang	4348976f80	[Diffusion] Refactor diffusion benchmark/profile skill to reuse diffusion-perf skill and clarify profiling trigger (#19783 )	2026-03-04 10:54:42 +08:00
Yuan Luo	82e7139c06	[VLM] Support cos sin cache for Ernie4.5-VL (#19743 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>	2026-03-04 10:54:23 +08:00
Xiaoyu Zhang	115e9a1acd	[Diffusion] Delete useless _ulysses_input_split func (#19786 )	2026-03-04 10:45:11 +08:00
xieminghe1	ee5ccde0ad	support fused_moe_triton and moe_sum_all_reduce kernel fusion[reduce … (#19672 ) Co-authored-by: undefined <zhouchen.arrebol@jd.com>	2026-03-04 10:30:33 +08:00
Charles Chen	d22c6a3847	fix: Properly return abort error for streaming requests if the abort is triggered by scheduler (#19357 )	2026-03-03 17:18:15 -08:00
Brayden Zhong	e2af840c3d	Various SM120 improvements (#19721 )	2026-03-03 16:46:13 -08:00
Hao Jin	a69b943356	[SGLang-Diffusion] Add offline throughput benchmark script for multi-modal models (#18154 ) Co-authored-by: Hao Jin <Hao Jin> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>	2026-03-04 08:39:46 +08:00
Hubert Lu	441045a7bf	[AMD] Fix EAGLE3 speculative decoding with aiter attention backend (#19362 )	2026-03-03 16:12:13 -08:00
Jiayi Yan	753da27535	[Bugfix] fix parse_lscpu_topology bug (#18520 )	2026-03-03 15:15:36 -08:00
Cao E	069c7e4188	Fix CI failures (#19303 )	2026-03-03 15:04:44 -08:00
Yi Zhong	b8c71f895e	Add tuned triton==3.5.1 h200 tp2, tp4 for qwen 3 next (#15948 ) Signed-off-by: vincentzed	2026-03-03 14:47:19 -08:00
Yi Zhong	0c760c4cd7	Add tuned triton==3.5.1 b200 tp2, tp4 for qwen 3 next (#15917 ) Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>	2026-03-03 14:47:05 -08:00
Jonah Bernard	fb37c0a400	[args] Add Expert Parallelism Argument To SRT Runner (#18492 ) Co-authored-by: Qiaolin Yu <liin1211@outlook.com>	2026-03-03 14:16:35 -08:00
Praneth Paruchuri	f7897def96	[Feature] Improve weight loading log (#18651 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-03-03 14:16:13 -08:00
Brayden Zhong	9305f0e58d	Support `triton_kernels` for GPT-OSS on SM120 (#19718 ) Co-authored-by: amittell 1388680+amittell@users.noreply.github.com	2026-03-03 14:14:01 -08:00
Sam (Kesen Li)	5b2e2750b5	Enable XQA for SM90 and SM120 (#17115 ) Co-authored-by: Xiaowei Wang <100599594+xiaoweiw-nv@users.noreply.github.com>	2026-03-03 14:09:44 -08:00
Kangyan-Zhou	dc92f88a21	Enhance bench_multiturn.py with OpenAI API support and richer metrics (#19724 ) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 13:48:04 -08:00
Guy Stone	f749802402	[Score API][18132] return token usage in Score API response (#18381 )	2026-03-03 13:45:35 -08:00
almaslof	b0f26698f5	feat(benchmark script): add similar to vllm --ready-check-timeout-sec parameter (#15466 )	2026-03-03 13:44:38 -08:00
Karthik Koralla	85ab6a7f54	cli: Add lazy imports and fail-fast config validation (RFC #9853 ) (#19368 ) Co-authored-by: hnyls2002 <lsyincs@gmail.com>	2026-03-03 13:03:49 -08:00
xrwang8	cedb86a950	Feature:Reserve HTTP server port before model loading to immediately detect port conflicts instead of failing after several minutes of model loading. (#17754 ) Signed-off-by: xrwang8 <xrwang8@gmail.com>	2026-03-03 11:59:18 -08:00
doujiang24	2e1b9e2547	Fix routed_dp_rank boundary validation (#19762 ) Signed-off-by: doujiang24 <doujiang24@gmail.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>	2026-03-03 11:55:15 -08:00
yefei12	85f7a0aa30	feat: support Kimi K2.5 for Eagle3 (#19689 ) Co-authored-by: chenyefei.cyf <chenyefei.cyf@U-9V5T77LW-2356.local> Co-authored-by: GeLee-Q <865038696@qq.com> Co-authored-by: Gao016 <yngao016@163.com> Co-authored-by: sxl1993 <1218197792@qq.com>	2026-03-03 13:41:15 -05:00
xutizhou	c6377bbbca	feat(gdn): add FlashInfer K-last SSM layout support for GDN prefill and decode for Hopper (#18361 ) Co-authored-by: HongliMi <106042350+HongliMi@users.noreply.github.com> Co-authored-by: xiaozhoupy <181108106+zhou9402@users.noreply.github.com> Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com> Co-authored-by: Avery Yingyi Huang <averyh@nvidia.com> Co-authored-by: eigen <52445717+yyihuang@users.noreply.github.com>	2026-03-03 20:30:48 +08:00
Jasonzhang517	d939e26585	[model gateway][0/N] router EPD support: add encoder grpc server backend support (#16552 ) Co-authored-by: Zongyao Chen <ZongYao.Chen@linux.alibaba.com> Co-authored-by: Zongyao Chen <solar1s@163.com>	2026-03-03 19:38:15 +08:00
Shangming Cai	facde4c6d3	[PD] Enable all CP ranks for KVCache transfer (#19765 ) Signed-off-by: Shangming Cai <csmthu@gmail.com>	2026-03-03 19:35:21 +08:00
shengzhaotian	365ca1edb5	[NPU] bugs fix: fix a condition bug when using speculative inference on Qwen3 and Qwen3 moe (#19532 )	2026-03-03 17:59:25 +08:00
Muqi Li	666caaf9ce	[Tool Call] Stream DeepSeek-V3.2 function call parameters in JSON format. (#16091 ) Co-authored-by: Huixxi <uestc.hugo@gmail.com>	2026-03-03 01:46:29 -08:00
Shaun Kotek	4c95953b77	Fix/nemotron mtp quantaized (#19433 )	2026-03-03 01:07:46 -08:00
Charles Chen	af0d35b224	Fix: Reject requests with a duplicate request ID which can cause server crash/hang (#19035 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com>	2026-03-03 00:33:25 -08:00
Muqi Li	6af0448cc9	[Bugfix] Catch errors when DeepSeek-V3.2 generates malformed JSON (#18174 )	2026-03-03 00:10:07 -08:00
Liangsheng Yin	7a2d3df96f	Apply default stream to priority 0 in scheduling. (#16438 )	2026-03-03 00:05:27 -08:00
Zack Yu	07b8d763ef	feat: Add FP8 KV cache support for Triton attention backend (#18882 )	2026-03-02 23:38:34 -08:00

... 21 22 23 24 25 ...

7855 Commits