sglang

mirror of https://github.com/kvcache-ai/sglang.git synced 2026-07-01 04:08:10 +00:00

Author	SHA1	Message	Date
shuwenn	b65799cf83	[SPEC][1/N] feat: add adaptive speculative_num_steps for EAGLE topk=1 (#21599 ) Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>	2026-04-20 14:25:04 -07:00
Lianmin Zheng	44e67c6835	Remove deprecated double sparsity feature (#23009 )	2026-04-17 13:33:12 -07:00
Jan Bernlöhr	04a53955b9	feat: add coordinated checkpoint prefetch for network filesystem loading (#20843 )	2026-04-16 20:08:19 -07:00
Liangsheng Yin	db7a751d48	refactor: extract FanOutCommunicator and use declarative spec table (#22967 )	2026-04-16 15:37:19 -07:00
hhwxw	2480cc2a16	docs: fix incorrect default max-payload-size in gateway config reference (#22923 )	2026-04-16 13:25:27 +08:00
cctry	f855a0bde6	Introduce CUDA graph debug mode with breakable CUDA graph (#19102 ) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: Cheng Wan <chwan@rice.edu> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-11 00:36:56 -07:00
Zhangheng	5ba7d4e523	[HiSparse]: Update HiSparse's user-guide (#22499 ) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>	2026-04-10 15:06:43 +08:00
Zhangheng	3d3a32c0b9	[HiSparse]: Add readme docs for HiSparse Feature (#22238 )	2026-04-07 00:39:24 -07:00
Khoa Pham	12272b6791	[Spec][Ngram] 6/N: Load an external corpus and construct a Suffix Automaton (#21425 )	2026-04-06 00:11:14 -07:00
YAMY	dc125afffb	Add staging buffer CI test and documentation for heterogeneous TP (#21921 ) Co-authored-by: Shangming Cai <csmthu@gmail.com>	2026-04-06 14:00:20 +08:00
narutolhy	24763256b9	[Speculative Decoding] Add FA4-based Spec Support (#21080 ) Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>	2026-04-04 02:09:45 -07:00
Liangsheng Yin	f25bf86065	Fix ngram doc for speculative_num_draft_tokens default (#21910 )	2026-04-01 22:18:24 -07:00
Khoa Pham	f836658077	[Spec][Ngram] 4/N: Remove `max_match_window_size` and `min_match_window_size`, matching all suffixes of the Trie (#21225 )	2026-04-01 22:09:46 -07:00
David Cheung	ed427e1299	Migrate all callers from /get_server_info to /server_info (#21463 )	2026-04-01 21:17:50 -07:00
Noa Neria	8d9145d97e	Direct model loading from object storage with Runai Model Streamer (#17948 ) Signed-off-by: Noa Neria <noa@run.ai>	2026-04-01 18:41:22 -07:00
Brayden Zhong	6a9b09847c	CUTLASS NVFP4 GEMM improvement of SM120 (#21314 )	2026-04-01 09:04:34 +08:00
Aishwarya Ramasethu	c32ee48886	MFU metrics in Prometheus (#19395 )	2026-03-29 23:40:06 -07:00
Артем Савкин	27071e0a43	[NPU] Update quantization&CI documentation (#21100 ) Co-authored-by: Tamir Baydasov <41994229+TamirBaydasov@users.noreply.github.com>	2026-03-28 21:42:21 +03:00
Baizhou Zhang	edd4d54023	[Clean] Remove deprecated environs (#21536 )	2026-03-28 00:35:44 -07:00
Jiaxin(Jackson) Deng	c4db64c16b	Add Lychee Doc Links Check to Local and CI (#19742 ) Co-authored-by: Zijie Xia <zijie_xia@icloud.com> Co-authored-by: Zijie Xia <zijiexia@users.noreply.github.com> Co-authored-by: zijiexia <37504505+zijiexia@users.noreply.github.com>	2026-03-24 13:48:26 -07:00
kpham-sgl	bc4aaab6a1	[Spec][Ngram] 2/N: Rename branch length to max trie depth (#21181 ) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 23:35:25 -07:00
kpham-sgl	6d160b42bb	[Spec][Ngram] 1/N: Reference based Speculative Decoding refactor (#20393 )	2026-03-22 00:55:10 -07:00
Xinyuan Tong	d1e95af282	Upgrade transformers==5.3.0 (#17784 ) Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> Co-authored-by: Alison Shao <alisonshao@mac.lan> Co-authored-by: Mick <mickjagger19@icloud.com>	2026-03-18 13:50:43 -07:00
ishandhanani	8f0f36c64b	[1/2] Add ModelExpress coordination for remote instance weight loading - matching TP (#19920 ) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Ishan Dhanani <ishan@dhanani.dev>	2026-03-18 13:38:32 -07:00
Kangyan-Zhou	3d8fc9a0ca	Revert "[Nvidia] Add trtllm mnnvl allreduce with unified flashinfer allreduce fusion api" (#20792 )	2026-03-17 11:59:02 -07:00
Shu Wang	d35fea1b2b	[Nvidia] Add trtllm mnnvl allreduce with unified flashinfer allreduce fusion api (#12787 )	2026-03-17 10:02:45 -07:00
Teng Ma	7c498a6538	[DOC] add documents for encoder global mm cache (#20636 )	2026-03-15 16:44:21 -07:00
Mook	23c191afb6	fix(docs): correct quantization documentation (#20301 ) (#20619 )	2026-03-15 12:33:12 -04:00
Liangsheng Yin	fc7f9c1de7	Rename --stream-output to --incremental-streaming-output (#20614 )	2026-03-14 23:22:33 -07:00
Matt Van Horn	d093e70067	[Doc] Add DSA/NSA attention backend to support matrix (#20326 ) Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 13:40:35 -04:00
Yoray Zack	9991debde3	[Feature] Integrate Elastic NIXL-EP into SGLang (#19248 ) Signed-off-by: Barak Biber <bbiber@nvidia.com> Signed-off-by: Yoray Zack <yorayz@nvidia.com> Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Barak Biber <bbiber@nvidia.com>	2026-03-11 17:37:43 +08:00
Liangsheng Yin	50953aea8d	[Scheduler] Unify idle checks into `is_fully_idle()` and fix weight update test (#20296 )	2026-03-10 17:50:23 -07:00
Ziang Li	76ee4bb98c	[FlashInfer v0.6.4] [RL] Integrate FlashInfer mxfp8 gemm, MoE, and routed MoE (#19537 )	2026-03-10 15:37:57 -07:00
shuwenn	5a11ae19c1	[CI] fix: notebook ci often OOM (#20199 )	2026-03-09 22:32:41 -07:00
Brayden Zhong	591e61245a	[Doc] Add smal table for GEMM backends (#20213 )	2026-03-09 22:19:57 -07:00
YEJIN KIM	0fd9a57d80	[Doc] Verify and Modify some attention backend specs (#20210 )	2026-03-09 23:05:52 +00:00
shuwenn	7bd3dd9270	fix: image URL in notebook to use raw.githubusercontent.com (#20100 )	2026-03-07 13:28:20 -08:00
Bruce Changlong Xu	feda2b11c4	[AMD] Add AWQ AMD CI coverage and quantization platform compatibility docs (#19550 )	2026-03-04 19:50:55 -08:00
Brayden Zhong	e2af840c3d	Various SM120 improvements (#19721 )	2026-03-03 16:46:13 -08:00
Sam (Kesen Li)	5b2e2750b5	Enable XQA for SM90 and SM120 (#17115 ) Co-authored-by: Xiaowei Wang <100599594+xiaoweiw-nv@users.noreply.github.com>	2026-03-03 14:09:44 -08:00
zwang86	d6ac5f23cc	[Docs] Add GDN attention backends matrix documentation (#19755 ) Co-authored-by: Zeyu Wang <zeyu.wang@yahooinc.com>	2026-03-03 13:00:34 -08:00
Jasonzhang517	d939e26585	[model gateway][0/N] router EPD support: add encoder grpc server backend support (#16552 ) Co-authored-by: Zongyao Chen <ZongYao.Chen@linux.alibaba.com> Co-authored-by: Zongyao Chen <solar1s@163.com>	2026-03-03 19:38:15 +08:00
Yuwei An	0abb9f4176	Piecewise Cuda Graph Docs (#19738 ) Signed-off-by: yuweia <ayw.sirius19@gmail.com> Co-authored-by: Wenyao Gao <wgao11@u.rochester.edu>	2026-03-03 11:51:17 +08:00
shuwenn	bdffb027a8	[CI] fix: handle missing repo in lora notebook (#19700 )	2026-03-02 10:27:32 -08:00
Shangming Cai	0a6678bf3a	[PD] Remove unused server args for disaggregation (#19618 ) Signed-off-by: Shangming Cai <csmthu@gmail.com>	2026-03-02 11:38:50 +08:00
shuwenn	e3e71f275a	docs: refactor speculative decoding doc (#19186 )	2026-03-01 22:03:20 -05:00
zwang86	f51ddba131	feat: add FA4 SM90 paged KV decode support & update attention docs (#18442 ) Co-authored-by: Zeyu Wang <zeyu.wang@yahooinc.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>	2026-03-02 09:12:19 +08:00
ympcMark	43fade5f69	[4/N] (Elastic EP) Back up Expert Weights in DRAM (#17374 ) Co-authored-by: UNIDY2002 <unidy2002@outlook.com>	2026-02-27 15:59:13 +08:00
billishyahao	60eeef7370	[AMD][with CI Fix] support two batch overlapping for mori ep (#19216 ) Co-authored-by: Duyi-Wang <duyi.wang@amd.com> Co-authored-by: kkHuang-amd <wunhuang@amd.com> Co-authored-by: Feiyue Zhai <feiyue.zhai@amd.com> Co-authored-by: HAI <hixiao@gmail.com>	2026-02-25 02:14:08 -08:00
huangtingwei	d40cb2f725	[HiCache] Support heterogeneous tp for hicache storage (#18541 ) Co-authored-by: hzh0425 <hzh0425@apache.org>	2026-02-25 00:13:57 -08:00

1 2 3 4 5

240 Commits