sglang

mirror of https://github.com/kvcache-ai/sglang.git synced 2026-07-01 04:08:10 +00:00

Author	SHA1	Message	Date
Lianmin Zheng	44e67c6835	Remove deprecated double sparsity feature (#23009 )	2026-04-17 13:33:12 -07:00
Jan Bernlöhr	04a53955b9	feat: add coordinated checkpoint prefetch for network filesystem loading (#20843 )	2026-04-16 20:08:19 -07:00
Khoa Pham	f836658077	[Spec][Ngram] 4/N: Remove `max_match_window_size` and `min_match_window_size`, matching all suffixes of the Trie (#21225 )	2026-04-01 22:09:46 -07:00
David Cheung	ed427e1299	Migrate all callers from /get_server_info to /server_info (#21463 )	2026-04-01 21:17:50 -07:00
Noa Neria	8d9145d97e	Direct model loading from object storage with Runai Model Streamer (#17948 ) Signed-off-by: Noa Neria <noa@run.ai>	2026-04-01 18:41:22 -07:00
Aishwarya Ramasethu	c32ee48886	MFU metrics in Prometheus (#19395 )	2026-03-29 23:40:06 -07:00
Baizhou Zhang	edd4d54023	[Clean] Remove deprecated environs (#21536 )	2026-03-28 00:35:44 -07:00
kpham-sgl	bc4aaab6a1	[Spec][Ngram] 2/N: Rename branch length to max trie depth (#21181 ) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 23:35:25 -07:00
kpham-sgl	6d160b42bb	[Spec][Ngram] 1/N: Reference based Speculative Decoding refactor (#20393 )	2026-03-22 00:55:10 -07:00
Kangyan-Zhou	3d8fc9a0ca	Revert "[Nvidia] Add trtllm mnnvl allreduce with unified flashinfer allreduce fusion api" (#20792 )	2026-03-17 11:59:02 -07:00
Shu Wang	d35fea1b2b	[Nvidia] Add trtllm mnnvl allreduce with unified flashinfer allreduce fusion api (#12787 )	2026-03-17 10:02:45 -07:00
Teng Ma	7c498a6538	[DOC] add documents for encoder global mm cache (#20636 )	2026-03-15 16:44:21 -07:00
Liangsheng Yin	fc7f9c1de7	Rename --stream-output to --incremental-streaming-output (#20614 )	2026-03-14 23:22:33 -07:00
Yoray Zack	9991debde3	[Feature] Integrate Elastic NIXL-EP into SGLang (#19248 ) Signed-off-by: Barak Biber <bbiber@nvidia.com> Signed-off-by: Yoray Zack <yorayz@nvidia.com> Signed-off-by: Itay Alroy <ialroy@nvidia.com> Co-authored-by: Barak Biber <bbiber@nvidia.com>	2026-03-11 17:37:43 +08:00
Ziang Li	76ee4bb98c	[FlashInfer v0.6.4] [RL] Integrate FlashInfer mxfp8 gemm, MoE, and routed MoE (#19537 )	2026-03-10 15:37:57 -07:00
Brayden Zhong	e2af840c3d	Various SM120 improvements (#19721 )	2026-03-03 16:46:13 -08:00
Yuwei An	0abb9f4176	Piecewise Cuda Graph Docs (#19738 ) Signed-off-by: yuweia <ayw.sirius19@gmail.com> Co-authored-by: Wenyao Gao <wgao11@u.rochester.edu>	2026-03-03 11:51:17 +08:00
Shangming Cai	0a6678bf3a	[PD] Remove unused server args for disaggregation (#19618 ) Signed-off-by: Shangming Cai <csmthu@gmail.com>	2026-03-02 11:38:50 +08:00
ympcMark	43fade5f69	[4/N] (Elastic EP) Back up Expert Weights in DRAM (#17374 ) Co-authored-by: UNIDY2002 <unidy2002@outlook.com>	2026-02-27 15:59:13 +08:00
billishyahao	60eeef7370	[AMD][with CI Fix] support two batch overlapping for mori ep (#19216 ) Co-authored-by: Duyi-Wang <duyi.wang@amd.com> Co-authored-by: kkHuang-amd <wunhuang@amd.com> Co-authored-by: Feiyue Zhai <feiyue.zhai@amd.com> Co-authored-by: HAI <hixiao@gmail.com>	2026-02-25 02:14:08 -08:00
Hubert Lu	17b0affbdf	[AMD] Support --enable-aiter-allreduce-fusion on AMD GPUs (#13747 ) Co-authored-by: yctseng0211 <yctseng@amd.com>	2026-02-24 23:11:55 -08:00
Baizhou Zhang	43f83525c0	Revert "[AMD] support two batch overlapping for mori ep #17953 " (#19161 )	2026-02-23 01:19:23 +08:00
billishyahao	fbb6098487	[AMD] support two batch overlapping for mori ep (#17953 ) Co-authored-by: kkHuang-amd <wunhuang@amd.com> Co-authored-by: Feiyue Zhai <feiyue.zhai@amd.com> Co-authored-by: Duyi-Wang <duyi.wang@amd.com> Co-authored-by: HAI <hixiao@gmail.com>	2026-02-20 08:45:55 -08:00
Mohammad Miadh Angkad	2f592c3b18	[Doc] Add `flashinfer_deepgemm` to `--fp8-gemm-backend` (#18982 )	2026-02-18 14:45:47 -05:00
Estrella-xx	1b3513a7e4	refactor FAKE transfer backend and remove --disaggregation-decode-enable-fake-auto parameter (#18345 )	2026-02-16 17:27:02 +03:00
Rain Jiang	0ffd0a3995	Nsa trtllm mla sparse fp8 support with Deepseek v3.2 NVFP4 (#18389 )	2026-02-16 09:29:54 +08:00
dongjiyingdjy	8b4c364960	refactor context parallel state (#17213 ) Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>	2026-02-13 23:18:17 +08:00
danielafrimi	e422bcaed8	[Mamba] Add float16 support for SSM cache dtype (#18444 )	2026-02-12 11:27:47 +08:00
qianyue76	f06ab17a73	[diffusion] docs: consolidate diffusion documentation into docs (#18095 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: JiaxinD <djx2048@gmail.com>	2026-02-11 16:55:07 -08:00
Baizhou Zhang	947927bdb5	[V3.2] Change default CP token split method to `--round-robin-split` (#18613 )	2026-02-11 20:14:35 +08:00
Mohammad Miadh Angkad	fddef76619	[Doc] Fix outdated `--fp4-gemm-backend` documentation (#18350 )	2026-02-07 20:42:47 +08:00
rinbaro	de6a03260f	[docs] fix misspellings & typos (#18276 )	2026-02-05 03:35:29 +00:00
Viacheslav	74f716dbd7	Gigachat 3 tool parser and tests (#14765 )	2026-02-02 22:28:34 -08:00
Ziang Li	3c9cc44ff5	Add mxfp8 support for online quantization, Triton dense linear, and CUTLASS MoE (#17449 )	2026-01-29 21:33:57 +08:00
Baizhou Zhang	832c756549	[Doc] Tiny update description on torch compile (#17819 )	2026-01-27 18:59:04 +08:00
Yi Zhong	08fcda2f63	add the fa4 mm backend and varlen func (#13539 ) Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com> Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>	2026-01-23 23:12:06 +08:00
hxie	13f88045b3	configuration file support and nixl integration augmentation for hicache-storage-backend-extra-config (#16602 )	2026-01-22 14:31:48 -08:00
zijiexia	4ecd9afde9	[Docs] Rename SGLang Router to SGLang Model Gateway (#17436 )	2026-01-20 12:31:10 -08:00
b8zhong	f374623fa9	[Refactor] Set `fp4-gemm-backend=auto` on SM100 and rename `fp4-gemm-backend` with `flashinfer_` prefix (#17309 )	2026-01-19 20:09:07 +08:00
Glen Liu	ad1b4e4728	[Feature] overlap LoRA weight loading with compute (#15512 )	2026-01-19 10:43:17 +08:00
Xinyuan Tong	2069050d3f	fix: Handle multiple named chat templates in HuggingFace tokenizers (#17236 ) Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>	2026-01-18 17:20:04 +08:00
shuwenn	8ec160ed46	feature: support uvicorn access log filter(disable logging /metrics) (#15513 )	2026-01-15 20:00:06 -08:00
shuwenn	9227d9f60c	[Docs] sort and update `server_arguments.md` (#17163 )	2026-01-15 12:07:18 -05:00
Glen Liu	6b065298b5	[Docs] add routing-key to schedule-policy in docs (#17101 )	2026-01-14 22:22:07 -05:00
shuwenn	cd33694585	feat: add --admin-api-key for finer-grained endpoint auth (#15908 ) Co-authored-by: Simo Lin <linsimo.mark@gmail.com>	2026-01-13 20:21:55 -08:00
Ratish P	c0248d6f37	[dpc]: unify DP controller load balancing and simplify dispatch logic (#16258 )	2026-01-11 12:38:03 +08:00
Huapeng Zhou	078270473a	[Doc] Default lora backend: csgmv (#16444 )	2026-01-05 12:45:49 +08:00
Yongfei Xu	0d244116d2	[DeepSeek v3.2] opt Context Parallelism: support fused moe, multi batch and fp8 kvcache (#13959 )	2026-01-02 23:49:14 +08:00
Huaixin Chang	c1dfbc777b	deprecate prefill-round-robin-balance (#16195 ) Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com> Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>	2025-12-31 22:25:33 +08:00
Mufeez Amjad	cbff7ad985	dp-attention: add follow_bootstrap_room + auto load-balance; drop decode_round_robin (#16110 )	2025-12-30 22:33:06 +08:00

1 2

95 Commits