Commit Graph

240 Commits

Author SHA1 Message Date
shuwenn
b65799cf83 [SPEC][1/N] feat: add adaptive speculative_num_steps for EAGLE topk=1 (#21599)
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
2026-04-20 14:25:04 -07:00
Lianmin Zheng
44e67c6835 Remove deprecated double sparsity feature (#23009) 2026-04-17 13:33:12 -07:00
Jan Bernlöhr
04a53955b9 feat: add coordinated checkpoint prefetch for network filesystem loading (#20843) 2026-04-16 20:08:19 -07:00
Liangsheng Yin
db7a751d48 refactor: extract FanOutCommunicator and use declarative spec table (#22967) 2026-04-16 15:37:19 -07:00
hhwxw
2480cc2a16 docs: fix incorrect default max-payload-size in gateway config reference (#22923) 2026-04-16 13:25:27 +08:00
cctry
f855a0bde6 Introduce CUDA graph debug mode with breakable CUDA graph (#19102)
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: Cheng Wan <chwan@rice.edu>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-11 00:36:56 -07:00
Zhangheng
5ba7d4e523 [HiSparse]: Update HiSparse's user-guide (#22499)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
2026-04-10 15:06:43 +08:00
Zhangheng
3d3a32c0b9 [HiSparse]: Add readme docs for HiSparse Feature (#22238) 2026-04-07 00:39:24 -07:00
Khoa Pham
12272b6791 [Spec][Ngram] 6/N: Load an external corpus and construct a Suffix Automaton (#21425) 2026-04-06 00:11:14 -07:00
YAMY
dc125afffb Add staging buffer CI test and documentation for heterogeneous TP (#21921)
Co-authored-by: Shangming Cai <csmthu@gmail.com>
2026-04-06 14:00:20 +08:00
narutolhy
24763256b9 [Speculative Decoding] Add FA4-based Spec Support (#21080)
Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>
2026-04-04 02:09:45 -07:00
Liangsheng Yin
f25bf86065 Fix ngram doc for speculative_num_draft_tokens default (#21910) 2026-04-01 22:18:24 -07:00
Khoa Pham
f836658077 [Spec][Ngram] 4/N: Remove max_match_window_size and min_match_window_size, matching all suffixes of the Trie (#21225) 2026-04-01 22:09:46 -07:00
David Cheung
ed427e1299 Migrate all callers from /get_server_info to /server_info (#21463) 2026-04-01 21:17:50 -07:00
Noa Neria
8d9145d97e Direct model loading from object storage with Runai Model Streamer (#17948)
Signed-off-by: Noa Neria <noa@run.ai>
2026-04-01 18:41:22 -07:00
Brayden Zhong
6a9b09847c CUTLASS NVFP4 GEMM improvement of SM120 (#21314) 2026-04-01 09:04:34 +08:00
Aishwarya Ramasethu
c32ee48886 MFU metrics in Prometheus (#19395) 2026-03-29 23:40:06 -07:00
Артем Савкин
27071e0a43 [NPU] Update quantization&CI documentation (#21100)
Co-authored-by: Tamir Baydasov <41994229+TamirBaydasov@users.noreply.github.com>
2026-03-28 21:42:21 +03:00
Baizhou Zhang
edd4d54023 [Clean] Remove deprecated environs (#21536) 2026-03-28 00:35:44 -07:00
Jiaxin(Jackson) Deng
c4db64c16b Add Lychee Doc Links Check to Local and CI (#19742)
Co-authored-by: Zijie Xia <zijie_xia@icloud.com>
Co-authored-by: Zijie Xia <zijiexia@users.noreply.github.com>
Co-authored-by: zijiexia <37504505+zijiexia@users.noreply.github.com>
2026-03-24 13:48:26 -07:00
kpham-sgl
bc4aaab6a1 [Spec][Ngram] 2/N: Rename branch length to max trie depth (#21181)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 23:35:25 -07:00
kpham-sgl
6d160b42bb [Spec][Ngram] 1/N: Reference based Speculative Decoding refactor (#20393) 2026-03-22 00:55:10 -07:00
Xinyuan Tong
d1e95af282 Upgrade transformers==5.3.0 (#17784)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>
Co-authored-by: Alison Shao <alisonshao@mac.lan>
Co-authored-by: Mick <mickjagger19@icloud.com>
2026-03-18 13:50:43 -07:00
ishandhanani
8f0f36c64b [1/2] Add ModelExpress coordination for remote instance weight loading - matching TP (#19920)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Ishan Dhanani <ishan@dhanani.dev>
2026-03-18 13:38:32 -07:00
Kangyan-Zhou
3d8fc9a0ca Revert "[Nvidia] Add trtllm mnnvl allreduce with unified flashinfer allreduce fusion api" (#20792) 2026-03-17 11:59:02 -07:00
Shu Wang
d35fea1b2b [Nvidia] Add trtllm mnnvl allreduce with unified flashinfer allreduce fusion api (#12787) 2026-03-17 10:02:45 -07:00
Teng Ma
7c498a6538 [DOC] add documents for encoder global mm cache (#20636) 2026-03-15 16:44:21 -07:00
Mook
23c191afb6 fix(docs): correct quantization documentation (#20301) (#20619) 2026-03-15 12:33:12 -04:00
Liangsheng Yin
fc7f9c1de7 Rename --stream-output to --incremental-streaming-output (#20614) 2026-03-14 23:22:33 -07:00
Matt Van Horn
d093e70067 [Doc] Add DSA/NSA attention backend to support matrix (#20326)
Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 13:40:35 -04:00
Yoray Zack
9991debde3 [Feature] Integrate Elastic NIXL-EP into SGLang (#19248)
Signed-off-by: Barak Biber <bbiber@nvidia.com>
Signed-off-by: Yoray Zack <yorayz@nvidia.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Co-authored-by: Barak Biber <bbiber@nvidia.com>
2026-03-11 17:37:43 +08:00
Liangsheng Yin
50953aea8d [Scheduler] Unify idle checks into is_fully_idle() and fix weight update test (#20296) 2026-03-10 17:50:23 -07:00
Ziang Li
76ee4bb98c [FlashInfer v0.6.4] [RL] Integrate FlashInfer mxfp8 gemm, MoE, and routed MoE (#19537) 2026-03-10 15:37:57 -07:00
shuwenn
5a11ae19c1 [CI] fix: notebook ci often OOM (#20199) 2026-03-09 22:32:41 -07:00
Brayden Zhong
591e61245a [Doc] Add smal table for GEMM backends (#20213) 2026-03-09 22:19:57 -07:00
YEJIN KIM
0fd9a57d80 [Doc] Verify and Modify some attention backend specs (#20210) 2026-03-09 23:05:52 +00:00
shuwenn
7bd3dd9270 fix: image URL in notebook to use raw.githubusercontent.com (#20100) 2026-03-07 13:28:20 -08:00
Bruce Changlong Xu
feda2b11c4 [AMD] Add AWQ AMD CI coverage and quantization platform compatibility docs (#19550) 2026-03-04 19:50:55 -08:00
Brayden Zhong
e2af840c3d Various SM120 improvements (#19721) 2026-03-03 16:46:13 -08:00
Sam (Kesen Li)
5b2e2750b5 Enable XQA for SM90 and SM120 (#17115)
Co-authored-by: Xiaowei Wang <100599594+xiaoweiw-nv@users.noreply.github.com>
2026-03-03 14:09:44 -08:00
zwang86
d6ac5f23cc [Docs] Add GDN attention backends matrix documentation (#19755)
Co-authored-by: Zeyu Wang <zeyu.wang@yahooinc.com>
2026-03-03 13:00:34 -08:00
Jasonzhang517
d939e26585 [model gateway][0/N] router EPD support: add encoder grpc server backend support (#16552)
Co-authored-by: Zongyao Chen <ZongYao.Chen@linux.alibaba.com>
Co-authored-by: Zongyao Chen <solar1s@163.com>
2026-03-03 19:38:15 +08:00
Yuwei An
0abb9f4176 Piecewise Cuda Graph Docs (#19738)
Signed-off-by: yuweia <ayw.sirius19@gmail.com>
Co-authored-by: Wenyao Gao <wgao11@u.rochester.edu>
2026-03-03 11:51:17 +08:00
shuwenn
bdffb027a8 [CI] fix: handle missing repo in lora notebook (#19700) 2026-03-02 10:27:32 -08:00
Shangming Cai
0a6678bf3a [PD] Remove unused server args for disaggregation (#19618)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
2026-03-02 11:38:50 +08:00
shuwenn
e3e71f275a docs: refactor speculative decoding doc (#19186) 2026-03-01 22:03:20 -05:00
zwang86
f51ddba131 feat: add FA4 SM90 paged KV decode support & update attention docs (#18442)
Co-authored-by: Zeyu Wang <zeyu.wang@yahooinc.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2026-03-02 09:12:19 +08:00
ympcMark
43fade5f69 [4/N] (Elastic EP) Back up Expert Weights in DRAM (#17374)
Co-authored-by: UNIDY2002 <unidy2002@outlook.com>
2026-02-27 15:59:13 +08:00
billishyahao
60eeef7370 [AMD][with CI Fix] support two batch overlapping for mori ep (#19216)
Co-authored-by: Duyi-Wang <duyi.wang@amd.com>
Co-authored-by: kkHuang-amd <wunhuang@amd.com>
Co-authored-by: Feiyue Zhai <feiyue.zhai@amd.com>
Co-authored-by: HAI <hixiao@gmail.com>
2026-02-25 02:14:08 -08:00
huangtingwei
d40cb2f725 [HiCache] Support heterogeneous tp for hicache storage (#18541)
Co-authored-by: hzh0425 <hzh0425@apache.org>
2026-02-25 00:13:57 -08:00