Commit Graph

187 Commits

Author SHA1 Message Date
billishyahao
fbb6098487 [AMD] support two batch overlapping for mori ep (#17953)
Co-authored-by: kkHuang-amd <wunhuang@amd.com>
Co-authored-by: Feiyue Zhai <feiyue.zhai@amd.com>
Co-authored-by: Duyi-Wang <duyi.wang@amd.com>
Co-authored-by: HAI <hixiao@gmail.com>
2026-02-20 08:45:55 -08:00
Mohammad Miadh Angkad
2f592c3b18 [Doc] Add flashinfer_deepgemm to --fp8-gemm-backend (#18982) 2026-02-18 14:45:47 -05:00
Mengyang Liu
4f980f6f23 [Feature] Implement update_weights_from_disk for SGLang-D (Diffusion … (#18306)
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
2026-02-18 11:24:07 -08:00
Estrella-xx
1b3513a7e4 refactor FAKE transfer backend and remove --disaggregation-decode-enable-fake-auto parameter (#18345) 2026-02-16 17:27:02 +03:00
Rain Jiang
0ffd0a3995 Nsa trtllm mla sparse fp8 support with Deepseek v3.2 NVFP4 (#18389) 2026-02-16 09:29:54 +08:00
SoluMilken
07a24f1a38 update pre-commit config (#18860) 2026-02-16 00:18:31 +08:00
shuwenn
4cf4f0859f [Doc] Convert the speculative decoding notebook to markdow (#18395) 2026-02-14 18:18:56 -08:00
shuwenn
3299c4f9c1 [CI] feat: add early exit to wait_for_server when process dies (#18602) 2026-02-13 16:46:09 -08:00
dongjiyingdjy
8b4c364960 refactor context parallel state (#17213)
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2026-02-13 23:18:17 +08:00
danielafrimi
e422bcaed8 [Mamba] Add float16 support for SSM cache dtype (#18444) 2026-02-12 11:27:47 +08:00
qianyue76
f06ab17a73 [diffusion] docs: consolidate diffusion documentation into docs (#18095)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: JiaxinD <djx2048@gmail.com>
2026-02-11 16:55:07 -08:00
Baizhou Zhang
947927bdb5 [V3.2] Change default CP token split method to --round-robin-split (#18613) 2026-02-11 20:14:35 +08:00
赵晨阳
a2c38f7796 Enhance SMG guide with RL rollout systems benefits (#18588) 2026-02-10 20:20:45 -08:00
AlexZhao
3167bcc01c [Doc] Comprehensive Guide: Navigating DP, DPA, and SMG Best Practices (#18096)
Co-authored-by: 赵海源 <zhaohaiyuan@xiaohongshu.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
2026-02-10 18:31:28 -08:00
Zack Yu
54589a2f2d docs: expand and update modelopt documentation (#18479)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-09 23:09:52 +00:00
Mohammad Miadh Angkad
fddef76619 [Doc] Fix outdated --fp4-gemm-backend documentation (#18350) 2026-02-07 20:42:47 +08:00
shuwenn
ef1d0ea885 [Doc] add a summary section for spec decode document (#18323) 2026-02-05 16:34:31 -05:00
shuwenn
8b21dd4b77 [Doc] refine spec decode docs for SpecV2/STANDALONE/NGRAM (#18321) 2026-02-05 15:12:33 -05:00
rinbaro
de6a03260f [docs] fix misspellings & typos (#18276) 2026-02-05 03:35:29 +00:00
Teng Ma
c8212b9fac [PD] doc: Document SGLANG_MOONCAKE_CUSTOM_MEM_POOL and supported values (#18259)
Co-authored-by: Shangming Cai <csmthu@gmail.com>
2026-02-05 03:03:33 +00:00
Viacheslav
74f716dbd7 Gigachat 3 tool parser and tests (#14765) 2026-02-02 22:28:34 -08:00
husf
c52578c7fd 【docs】【NPU】Update Expert Parallelism docs for Ascend NPU (#17940) 2026-01-30 23:42:11 -05:00
Ziang Li
3c9cc44ff5 Add mxfp8 support for online quantization, Triton dense linear, and CUTLASS MoE (#17449) 2026-01-29 21:33:57 +08:00
kk
f1384f5293 Integration mori backend for EP a2a data communication (#17012)
Co-authored-by: Duyi-Wang <duyi.wang@amd.com>
Co-authored-by: billishyahao <bill.he@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
2026-01-28 19:07:34 -08:00
Baizhou Zhang
832c756549 [Doc] Tiny update description on torch compile (#17819) 2026-01-27 18:59:04 +08:00
shuwenn
fd3b179ffd [HiCache][HA 1/N] Support HiCache storage runtime attach/detach (#15892) 2026-01-26 19:33:19 -08:00
zijiexia
dd97e1fe38 [Docs] Add RL documentation (#17663)
Co-authored-by: JD <jaedon.guo@gmail.com>
2026-01-26 12:16:54 -08:00
zackyoray
d275d47973 [NIXL] Add custom NIXL backend selection for KVManager (#17146)
Signed-off-by: Yoray Zack <yorayz@nvidia.com>
2026-01-26 14:35:38 +08:00
Trevor Morris
2c2c4e446b [NVIDIA] Add flashinfer all-to-all MOE dispatcher (#14668) 2026-01-24 22:59:55 +08:00
Glen Liu
a6280b2a23 add documentation example for LoRA overlap loading and cleanup unused function (#17464) 2026-01-24 15:33:16 +08:00
Yi Zhong
08fcda2f63 add the fa4 mm backend and varlen func (#13539)
Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2026-01-23 23:12:06 +08:00
hxie
13f88045b3 configuration file support and nixl integration augmentation for hicache-storage-backend-extra-config (#16602) 2026-01-22 14:31:48 -08:00
zijiexia
4ecd9afde9 [Docs] Rename SGLang Router to SGLang Model Gateway (#17436) 2026-01-20 12:31:10 -08:00
Ruihuan He
eb38d64413 [Docs] Explain CUDA attention backend choices and aiter FP8 KV cache support (#17428) 2026-01-20 11:10:36 -08:00
Shangming Cai
23d765d1f2 [Doc] Update pipeline parallelism documentation (#17414) 2026-01-20 20:10:56 +08:00
b8zhong
f374623fa9 [Refactor] Set fp4-gemm-backend=auto on SM100 and rename fp4-gemm-backend with flashinfer_ prefix (#17309) 2026-01-19 20:09:07 +08:00
Glen Liu
ad1b4e4728 [Feature] overlap LoRA weight loading with compute (#15512) 2026-01-19 10:43:17 +08:00
Xinyuan Tong
2069050d3f fix: Handle multiple named chat templates in HuggingFace tokenizers (#17236)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2026-01-18 17:20:04 +08:00
zijiexia
166396ca4c [Docs] minor update on ep docs (#17242) 2026-01-16 18:20:47 -08:00
shuwenn
8ec160ed46 feature: support uvicorn access log filter(disable logging /metrics) (#15513) 2026-01-15 20:00:06 -08:00
Yi Zhong
d1110e1c3e docs only add kimi k2 thinking and kimi linear (#15789) 2026-01-15 12:09:52 -05:00
shuwenn
9227d9f60c [Docs] sort and update server_arguments.md (#17163) 2026-01-15 12:07:18 -05:00
Glen Liu
6b065298b5 [Docs] add routing-key to schedule-policy in docs (#17101) 2026-01-14 22:22:07 -05:00
fxmarty-amd
5af84c8af5 [AMD][Quantization] Add int4fp8_moe online quantization on ROCm (#7392)
Co-authored-by: Dehua Tang <dehtang@amd.com>
Co-authored-by: HAI <hixiao@gmail.com>
Co-authored-by: YC Tseng <yctseng@amd.com>
2026-01-14 01:44:40 -08:00
shuwenn
cd33694585 feat: add --admin-api-key for finer-grained endpoint auth (#15908)
Co-authored-by: Simo Lin <linsimo.mark@gmail.com>
2026-01-13 20:21:55 -08:00
James
ae0baefb94 [NPU] upgrade npu mf_apater plugin (#15853) 2026-01-13 09:02:10 +08:00
Wenyi Xu
3c16c58619 [model-gateway] Add Redis support as a history backend (#16300) 2026-01-11 01:03:00 -08:00
Ratish P
c0248d6f37 [dpc]: unify DP controller load balancing and simplify dispatch logic (#16258) 2026-01-11 12:38:03 +08:00
Shangming Cai
973116e6bb [Doc] Optimize pipeline parallelism doc (#16630)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
2026-01-07 14:52:42 +08:00
Huapeng Zhou
078270473a [Doc] Default lora backend: csgmv (#16444) 2026-01-05 12:45:49 +08:00