Commit Graph

95 Commits

Author SHA1 Message Date
Lianmin Zheng
44e67c6835 Remove deprecated double sparsity feature (#23009) 2026-04-17 13:33:12 -07:00
Jan Bernlöhr
04a53955b9 feat: add coordinated checkpoint prefetch for network filesystem loading (#20843) 2026-04-16 20:08:19 -07:00
Khoa Pham
f836658077 [Spec][Ngram] 4/N: Remove max_match_window_size and min_match_window_size, matching all suffixes of the Trie (#21225) 2026-04-01 22:09:46 -07:00
David Cheung
ed427e1299 Migrate all callers from /get_server_info to /server_info (#21463) 2026-04-01 21:17:50 -07:00
Noa Neria
8d9145d97e Direct model loading from object storage with Runai Model Streamer (#17948)
Signed-off-by: Noa Neria <noa@run.ai>
2026-04-01 18:41:22 -07:00
Aishwarya Ramasethu
c32ee48886 MFU metrics in Prometheus (#19395) 2026-03-29 23:40:06 -07:00
Baizhou Zhang
edd4d54023 [Clean] Remove deprecated environs (#21536) 2026-03-28 00:35:44 -07:00
kpham-sgl
bc4aaab6a1 [Spec][Ngram] 2/N: Rename branch length to max trie depth (#21181)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 23:35:25 -07:00
kpham-sgl
6d160b42bb [Spec][Ngram] 1/N: Reference based Speculative Decoding refactor (#20393) 2026-03-22 00:55:10 -07:00
Kangyan-Zhou
3d8fc9a0ca Revert "[Nvidia] Add trtllm mnnvl allreduce with unified flashinfer allreduce fusion api" (#20792) 2026-03-17 11:59:02 -07:00
Shu Wang
d35fea1b2b [Nvidia] Add trtllm mnnvl allreduce with unified flashinfer allreduce fusion api (#12787) 2026-03-17 10:02:45 -07:00
Teng Ma
7c498a6538 [DOC] add documents for encoder global mm cache (#20636) 2026-03-15 16:44:21 -07:00
Liangsheng Yin
fc7f9c1de7 Rename --stream-output to --incremental-streaming-output (#20614) 2026-03-14 23:22:33 -07:00
Yoray Zack
9991debde3 [Feature] Integrate Elastic NIXL-EP into SGLang (#19248)
Signed-off-by: Barak Biber <bbiber@nvidia.com>
Signed-off-by: Yoray Zack <yorayz@nvidia.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Co-authored-by: Barak Biber <bbiber@nvidia.com>
2026-03-11 17:37:43 +08:00
Ziang Li
76ee4bb98c [FlashInfer v0.6.4] [RL] Integrate FlashInfer mxfp8 gemm, MoE, and routed MoE (#19537) 2026-03-10 15:37:57 -07:00
Brayden Zhong
e2af840c3d Various SM120 improvements (#19721) 2026-03-03 16:46:13 -08:00
Yuwei An
0abb9f4176 Piecewise Cuda Graph Docs (#19738)
Signed-off-by: yuweia <ayw.sirius19@gmail.com>
Co-authored-by: Wenyao Gao <wgao11@u.rochester.edu>
2026-03-03 11:51:17 +08:00
Shangming Cai
0a6678bf3a [PD] Remove unused server args for disaggregation (#19618)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
2026-03-02 11:38:50 +08:00
ympcMark
43fade5f69 [4/N] (Elastic EP) Back up Expert Weights in DRAM (#17374)
Co-authored-by: UNIDY2002 <unidy2002@outlook.com>
2026-02-27 15:59:13 +08:00
billishyahao
60eeef7370 [AMD][with CI Fix] support two batch overlapping for mori ep (#19216)
Co-authored-by: Duyi-Wang <duyi.wang@amd.com>
Co-authored-by: kkHuang-amd <wunhuang@amd.com>
Co-authored-by: Feiyue Zhai <feiyue.zhai@amd.com>
Co-authored-by: HAI <hixiao@gmail.com>
2026-02-25 02:14:08 -08:00
Hubert Lu
17b0affbdf [AMD] Support --enable-aiter-allreduce-fusion on AMD GPUs (#13747)
Co-authored-by: yctseng0211 <yctseng@amd.com>
2026-02-24 23:11:55 -08:00
Baizhou Zhang
43f83525c0 Revert "[AMD] support two batch overlapping for mori ep #17953" (#19161) 2026-02-23 01:19:23 +08:00
billishyahao
fbb6098487 [AMD] support two batch overlapping for mori ep (#17953)
Co-authored-by: kkHuang-amd <wunhuang@amd.com>
Co-authored-by: Feiyue Zhai <feiyue.zhai@amd.com>
Co-authored-by: Duyi-Wang <duyi.wang@amd.com>
Co-authored-by: HAI <hixiao@gmail.com>
2026-02-20 08:45:55 -08:00
Mohammad Miadh Angkad
2f592c3b18 [Doc] Add flashinfer_deepgemm to --fp8-gemm-backend (#18982) 2026-02-18 14:45:47 -05:00
Estrella-xx
1b3513a7e4 refactor FAKE transfer backend and remove --disaggregation-decode-enable-fake-auto parameter (#18345) 2026-02-16 17:27:02 +03:00
Rain Jiang
0ffd0a3995 Nsa trtllm mla sparse fp8 support with Deepseek v3.2 NVFP4 (#18389) 2026-02-16 09:29:54 +08:00
dongjiyingdjy
8b4c364960 refactor context parallel state (#17213)
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2026-02-13 23:18:17 +08:00
danielafrimi
e422bcaed8 [Mamba] Add float16 support for SSM cache dtype (#18444) 2026-02-12 11:27:47 +08:00
qianyue76
f06ab17a73 [diffusion] docs: consolidate diffusion documentation into docs (#18095)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: JiaxinD <djx2048@gmail.com>
2026-02-11 16:55:07 -08:00
Baizhou Zhang
947927bdb5 [V3.2] Change default CP token split method to --round-robin-split (#18613) 2026-02-11 20:14:35 +08:00
Mohammad Miadh Angkad
fddef76619 [Doc] Fix outdated --fp4-gemm-backend documentation (#18350) 2026-02-07 20:42:47 +08:00
rinbaro
de6a03260f [docs] fix misspellings & typos (#18276) 2026-02-05 03:35:29 +00:00
Viacheslav
74f716dbd7 Gigachat 3 tool parser and tests (#14765) 2026-02-02 22:28:34 -08:00
Ziang Li
3c9cc44ff5 Add mxfp8 support for online quantization, Triton dense linear, and CUTLASS MoE (#17449) 2026-01-29 21:33:57 +08:00
Baizhou Zhang
832c756549 [Doc] Tiny update description on torch compile (#17819) 2026-01-27 18:59:04 +08:00
Yi Zhong
08fcda2f63 add the fa4 mm backend and varlen func (#13539)
Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2026-01-23 23:12:06 +08:00
hxie
13f88045b3 configuration file support and nixl integration augmentation for hicache-storage-backend-extra-config (#16602) 2026-01-22 14:31:48 -08:00
zijiexia
4ecd9afde9 [Docs] Rename SGLang Router to SGLang Model Gateway (#17436) 2026-01-20 12:31:10 -08:00
b8zhong
f374623fa9 [Refactor] Set fp4-gemm-backend=auto on SM100 and rename fp4-gemm-backend with flashinfer_ prefix (#17309) 2026-01-19 20:09:07 +08:00
Glen Liu
ad1b4e4728 [Feature] overlap LoRA weight loading with compute (#15512) 2026-01-19 10:43:17 +08:00
Xinyuan Tong
2069050d3f fix: Handle multiple named chat templates in HuggingFace tokenizers (#17236)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2026-01-18 17:20:04 +08:00
shuwenn
8ec160ed46 feature: support uvicorn access log filter(disable logging /metrics) (#15513) 2026-01-15 20:00:06 -08:00
shuwenn
9227d9f60c [Docs] sort and update server_arguments.md (#17163) 2026-01-15 12:07:18 -05:00
Glen Liu
6b065298b5 [Docs] add routing-key to schedule-policy in docs (#17101) 2026-01-14 22:22:07 -05:00
shuwenn
cd33694585 feat: add --admin-api-key for finer-grained endpoint auth (#15908)
Co-authored-by: Simo Lin <linsimo.mark@gmail.com>
2026-01-13 20:21:55 -08:00
Ratish P
c0248d6f37 [dpc]: unify DP controller load balancing and simplify dispatch logic (#16258) 2026-01-11 12:38:03 +08:00
Huapeng Zhou
078270473a [Doc] Default lora backend: csgmv (#16444) 2026-01-05 12:45:49 +08:00
Yongfei Xu
0d244116d2 [DeepSeek v3.2] opt Context Parallelism: support fused moe, multi batch and fp8 kvcache (#13959) 2026-01-02 23:49:14 +08:00
Huaixin Chang
c1dfbc777b deprecate prefill-round-robin-balance (#16195)
Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>
Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>
2025-12-31 22:25:33 +08:00
Mufeez Amjad
cbff7ad985 dp-attention: add follow_bootstrap_room + auto load-balance; drop decode_round_robin (#16110) 2025-12-30 22:33:06 +08:00