Commit Graph

8148 Commits

Author SHA1 Message Date
maocheng23
431ca54334 [fix] /pause_generation and /continue_generation wrong for --tokenizer-worker-num > 1 (#24445)
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-05 16:27:58 -07:00
Liangsheng Yin
08d4c2072b move topk capturers to srt/state_capturer/ (#24450)
Co-authored-by: Yueming Yuan <yym022502@gmail.com>
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Co-authored-by: Ziang Li <ziangli@umich.edu>
2026-05-05 15:54:01 -07:00
Liangsheng Yin
47a416fc62 add indexer-topk capture (V3.2 NSA + infra) (#24392) 2026-05-05 15:05:15 -07:00
Liangsheng Yin
c4c0376fcb consolidate routed-experts capturer onto reusable base (#24403) 2026-05-05 12:41:49 -07:00
Mick
d23ef408f7 [diffusion] fix: fix RowParallel LoRA merged forwarding (#24410) 2026-05-06 00:30:16 +08:00
Mick
cc54d8e8d0 [diffusion] chore: clean CUDA cache only at explicit release points (#24397) 2026-05-05 22:30:43 +08:00
Polisetty V R K Jyothendra Varma
fdfc46f3a5 [Intel GPU] Enable DeepSeek V3.2 inference on XPU (#24356)
Signed-off-by: P V R K Jyothendra Varma <polisetty.v.r.k.jyothendra.varma@intel.com>
2026-05-05 20:47:40 +08:00
Zhangheng
e299ec1bff [UnifiedRadixTree]: Fix flaky ci (#24421) 2026-05-05 20:22:19 +08:00
Khoa Pham
d22853480d Fix deterministic inference on models with SWAKVPool (#24395) 2026-05-05 20:20:46 +08:00
Bi Xue
9fb9a1cca6 [sgl] expose swa and mamba cache metrics (#24396) 2026-05-05 20:19:50 +08:00
Xiaoyu Zhang
67e8bd7a80 [codex] Optimize Helios fused norm modulation (#24059) 2026-05-05 19:28:37 +08:00
Xiaoyu Zhang
8c703f215e Add HunyuanVideo ModelOpt FP8 diffusion support (#23199) 2026-05-05 19:27:28 +08:00
billishyahao
80ccb6b93c [AMD] fix tbo specv2 seq_lens_cpu NoneType error (#24319) 2026-05-05 01:54:43 -07:00
Mick
177babcc38 [diffusion] optimize: fuse LTX2 split rotary embedding (#24411) 2026-05-05 16:07:40 +08:00
Hubert Lu
c2db19ffa4 [AMD] Enable EAGLE speculative decoding for Qwen3.5 FP8 and MXFP4 models with aiter's unified attention (#23146)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: sogalin <39478626+sogalin@users.noreply.github.com>
2026-05-05 00:09:40 -07:00
Mick
04926e1d9f [diffusion] feat: cache encoder results for default negative prompt (#24304) 2026-05-05 11:56:01 +08:00
Mick
e483e60b72 [diffusion] CI: pin diffusion consistency GT revision (#24400) 2026-05-05 11:53:22 +08:00
Ethan (Yusheng) Su
2b769d37a4 (2/n - prefill optimize)perf(lora): remove GPU-CPU sync barrier (.item()) in MoE LoRA path and remove duplicate code (#24246)
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-04 18:11:28 -07:00
Mick
2f7d99b7f7 [diffusion] cli: support component attention backend overrides (#24320) 2026-05-05 08:39:27 +08:00
Xiaoyu Zhang
078f84d80d [SKILL] Add diffusion benchmark presets for edit and Hunyuan3D models (#24288)
Co-authored-by: BBuf Codex <bbuf-codex@users.noreply.github.com>
2026-05-05 08:18:12 +08:00
Ji Zeng
4b487ca98b [Fix] NGRAMWorker.update_weights_from_tensor — delegate to target worker (#24344) 2026-05-04 16:23:17 -07:00
Liangsheng Yin
6a62eabed6 consolidate NSA pool construction (#24389)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2026-05-04 16:04:31 -07:00
Tarushii Goel
d7c93e183b [sgl] reduce specdec cpu overhead (#23321) 2026-05-04 15:02:03 -07:00
Liangsheng Yin
4743cf6051 misc: add marlin to moe runner choices; drop dead env var doc (#24384)
Co-authored-by: fzyzcjy <ch271828n@outlook.com>
2026-05-04 15:01:47 -07:00
Lianmin Zheng
29dd3a36c0 Refactor device timer installation and rename prefill prealloc to bootstrap (#24341) 2026-05-04 13:57:13 -07:00
Vladislav Nosivskoy
60a1dacd89 [HiCache] return cached_tokens_details in sglext for streaming responses (#22055)
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
2026-05-04 12:30:17 -07:00
Ke Bao
6dd7aebb36 Minor scheduler fixes (#24359) 2026-05-05 02:01:23 +08:00
Sam Shleifer
e6f252e9b8 Cache FlashInfer autotune configs (#24156) 2026-05-05 02:00:40 +08:00
Yuan Luo
e5c58eb9d6 [VLM] Optimize Gemma4 VLM with PCG and fuse RMSNorm + residual add + scalar (#24048)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2026-05-04 09:36:26 -07:00
Mick
1be3163011 [diffusion] fix: use direct all-to-all for USP collectives (#24366) 2026-05-05 00:08:48 +08:00
Xiaoyu Zhang
4b6d44641b [diffusion] chore: enable channels-last 3D VAE convs by default (#23200) 2026-05-04 22:59:31 +08:00
Zhangheng
05aed5e1d5 [UnifiedRadixTree]: Add KL accuracy CI for UnifiedTree with HiCache (#24346) 2026-05-04 20:18:10 +08:00
Liangsheng Yin
84f3b44916 [tiny] misc cleanups across configs, attention, jit_kernel (#24350) 2026-05-04 03:17:14 -07:00
Linzhang Li
952b3caf18 feat: use structural tags to enable strict tool calling and reasoning for more models (#21722)
Signed-off-by: Yuchuan <yuchuan.7streams@gmail.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: Ubospica <ubospica@gmail.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2026-05-04 02:30:28 -07:00
Khoa Pham
ef2b1b6d89 Fix flashinfer workspace OOM (#24172) 2026-05-04 01:26:35 -07:00
Ke Bao
aea527afdc Fix swa chunk req deferred (#24318) 2026-05-04 14:52:15 +08:00
Liangsheng Yin
a91ae6af9e nextn subclass owns post_load_weights is_nextn (#24333) 2026-05-03 22:04:44 -07:00
Liangsheng Yin
1dd8f6d5ae dedup state_kv_args setup into helper (#24340) 2026-05-03 20:45:26 -07:00
Liangsheng Yin
91fa2340ed extract adjust_hybrid_swa_layers_for_pp (#24334) 2026-05-03 18:52:54 -07:00
Ethan (Yusheng) Su
b7fefc0e85 feat(lora): enable csgmv backend with virtual experts for MoE LoRA (#24007) 2026-05-03 18:44:17 -07:00
Mick
c611a3fb78 [diffusion] chore: disable VAE cpu offload by default (#24315) 2026-05-04 08:24:51 +08:00
Liangsheng Yin
00d620b77d introduce arg_groups/ with nemotron_h hook (#24328) 2026-05-03 16:28:11 -07:00
Liangsheng Yin
c3b6d20a80 Register deepseek_v32 alias instead of rewriting config.json (#24295) 2026-05-03 16:02:17 -07:00
Zhangheng
9a5450ad73 [PD]: Support incremental transfer for mooncake transfer engine (#24257)
Co-authored-by: Shangming Cai <csmthu@gmail.com>
2026-05-04 00:57:59 +08:00
Chi McIsaac
62265ca7fc [diffusion] feat: initial support for dynamic batching (#18764)
Signed-off-by: Chi McIsaac <chixie.mcisaac@gmail.com>
Co-authored-by: Junhao Liu <junhaoliu2023@gmail.com>
2026-05-04 00:44:42 +08:00
Xiaoyu Zhang
f2d1390909 [Diffusion] Add Qwen Image ModelOpt FP8 support (#23155)
Co-authored-by: Mick <mickjagger19@icloud.com>
2026-05-04 00:24:22 +08:00
Mick
5925572c95 [diffusion] CI: switch CI data references to sgl-project/ci-data (#24299) 2026-05-03 23:05:12 +08:00
Zhangheng
c0f5950636 [UnifiedRadixTree]: Support HiCache Framework for UnifiedRadixTree (#23316)
Co-authored-by: JINZ <1023553676@qq.com>
Co-authored-by: diemchai <diemchai@tencent.com>
2026-05-03 22:13:22 +08:00
GXIN
e37f46fcf7 [NPU] Fix Z-Image negative-branch rotary embeddings for CFG (#23538)
Co-authored-by: ronnie_zheng <zl19940307@163.com>
2026-05-03 16:18:26 +03:00
Zhangheng
44ca2d01fc [pd]: (Bug Fix) Incorrect out_cache_loc slicing in prepare_for_prebuilt (#24230)
Co-authored-by: Shangming Cai <csmthu@gmail.com>
2026-05-03 18:35:16 +08:00