Commit Graph

7489 Commits

Author SHA1 Message Date
sglang-bot
46bf19cdab chore: bump flashinfer version to 0.6.7.post2 (#22097)
Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com>
2026-04-04 02:16:25 -07:00
narutolhy
24763256b9 [Speculative Decoding] Add FA4-based Spec Support (#21080)
Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>
2026-04-04 02:09:45 -07:00
Yuhao Yang
34d5765e2f [VLM] Chunk-aware ViT encoding with per-image cache and lazy device transfer (#22038) 2026-04-04 16:55:17 +08:00
Piotr Mazurek
b5e8c4b9e3 model: support LFM2-VL (Liquid Foundation Model 2 Vision-Language) (#21230)
Co-authored-by: Piotr Mazurek <piotr.mazurek@liquid.ai>
2026-04-04 16:36:04 +08:00
R0CKSTAR
1fb4bf3558 [diffusion] fix: validate attention backend for Ring Attention in USPAttention (#21828)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2026-04-04 16:24:02 +08:00
harrisonlimh
9fa12d605a Add dsv3 router gemm benchmark on blackwell (#17707) 2026-04-04 01:18:01 -07:00
Xiaoyu Zhang
82ea4906cf [diffusion] Default NVFP4 to CUTLASS and add all-model shape benchmarks (#22091) 2026-04-04 16:14:38 +08:00
Ethan (Yusheng) Su
ff8e47edf9 [5/n] Lora support cuda graph (#21647) 2026-04-04 00:31:46 -07:00
Douglas Yang
a94c3804c2 fix: mistral embedding regression fix (#21913) 2026-04-04 00:11:51 -07:00
Chi McIsaac
005e582d06 [diffusion] improve: norm fusion for z-image (#18762)
Signed-off-by: Chi McIsaac <chixie.mcisaac@gmail.com>
Co-authored-by: yihanc <yingluosanqian@gmail.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
2026-04-04 14:01:01 +08:00
Qiaolin Yu
ef13031243 Tiny fix step3.5-flash launch crash (#22076) 2026-04-03 22:25:25 -07:00
Ziang Li
990c7590b8 [RL] Support mxfp8 DeepSeek V3 (#21280) 2026-04-03 21:57:45 -07:00
faceless void
de9859073f Add --stream-response-default-include-usage server flag (#16711) 2026-04-03 21:36:00 -07:00
CHEN Xi
31c9d8e885 [Diffusion] Fix weight scale swizzle and add large-M kernel config for FLUX.2-dev-NVFP4 (#22064) 2026-04-04 11:50:30 +08:00
Yilong Zhao
fe92f3563c dp: add profile req hook (#22083) 2026-04-03 20:47:09 -07:00
Yuxuan Zhang
b7ae3b5a9a GLM-4.7 and GLM-4.7-Flash Loading and import format (#21851) 2026-04-03 20:44:08 -07:00
Prozac614
db3d4f4b76 [diffusion] model: support two stage pipeline of LTX-2 (#20707)
Co-authored-by: daiweitao <dwti614707404@163.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: GMI Xiao Jin <xiao.j@gmicloud.ai>
2026-04-04 09:37:28 +08:00
Liangsheng Yin
95cdbce34f [Test] Extract common PD server setup into base fixture (#22080) 2026-04-03 16:37:12 -07:00
Lawrence Wu
9593d434c4 fix: pause_generation should not populate running_batch on prefill nodes (#20273) 2026-04-03 16:16:06 -07:00
Sundara Raman Ramachandran
90e86800f4 [Score API] Implement EngineScoreMixin for scoring functionality and refactor Tok… (#21342) 2026-04-03 15:17:42 -07:00
Baizhou Zhang
ac1e437f6a Revert "[Feature] JIT activation and update skills (by codex)" (#22078) 2026-04-03 15:04:15 -07:00
Mohammad Miadh Angkad
8cb337c8ea [Bugfix] Temporarily skip TRTLLM attention on (G)B300 (SM103) to avoid high-concurrency hang (#21906)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2026-04-03 14:19:13 -07:00
Yz Xiao
1d7a53dd03 [Fix] XGrammarGrammarBackend reset to clear inherited cache (#22054) 2026-04-03 14:17:59 -07:00
sglang-bot
84118acf50 chore: bump sglang-kernel version to 0.4.1 (#22009)
Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com>
2026-04-03 13:58:35 -07:00
Lianmin Zheng
eb407b80f3 [Kernel] Make FA3/FA4 imports lazy in FlashAttentionBackend (#22028) 2026-04-03 13:49:00 -07:00
Brayden Zhong
6aafe756b9 Revert "[Feature] NVFP4 Marlin fallback for non-Blackwell GPUs (SM75+… (#22047) 2026-04-03 13:12:30 -07:00
Shiyan Deng
0c9dc098e7 Fix DP attention worker port binding for IPv6 support (#21917)
Signed-off-by: Shiyan Deng <dsy842974287@meta.com>
2026-04-03 12:39:39 -07:00
Zhangheng
ed3435e37f [HiSparse]: Optimize server args checking-HiSparse is temporarily only available for DSA models. (#22065) 2026-04-04 02:23:56 +08:00
Mick
151f727163 [diffusion] fix: fix gated repo failing the generate cmd (#22040)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-04 00:43:11 +08:00
DarkSharpness
44e5d35703 [Feature][JIT Kernel] JIT activation and update skills (by codex) (#21766)
Co-authored-by: weiminc <tnwilly@gmail.com>
2026-04-03 23:28:54 +08:00
Mick
030fb1c4b1 refactor: replace mm_inputs dict with MultimodalProcessorOutput (#21738) 2026-04-03 23:26:37 +08:00
Ke Bao
9f409d0749 [CI] Adjust CI server launch timeout (#22045) 2026-04-03 22:38:07 +08:00
Xiaoyu Zhang
ee9d922f5a Revert "[Kernel] Fuse temperature + softmax in sampling for decode speedup" (#22046) 2026-04-03 21:32:08 +08:00
Kangyan-Zhou
56ac9c9932 [Fix] Add _MOE_TP to graph_capture for MoE models with ep>1 (#21907)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2026-04-03 02:33:16 -07:00
Khoa Pham
cd75d54fc5 [Bugfix] Fix CUDA graph replay issues in trtllm_mla draft_extend (#21987)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 01:45:13 -07:00
shuwenn
4f84ce5807 [CI] ci: add test_http_server_auth.py to CI (#21866) 2026-04-03 16:32:18 +08:00
Thomas Wang
7431db7392 [AMD] Enable FP8 KV cache and FP8 attention kernel for NSA on MI300/MI355 with TileLang backend (#21511) 2026-04-03 00:58:23 -07:00
Kelon
ad0516d9c1 [NPU] optimize glm4.7 (#19246) 2026-04-03 15:44:07 +08:00
Shangming Cai
d82097a0df [PD] Tiny register info field cleanup for mooncake backend (#22016) 2026-04-03 15:13:44 +08:00
Ricardo-M-L
24f52e66d3 fix: remove duplicate words in comments (#22007) 2026-04-03 00:05:39 -07:00
Yuzhen Zhou
6b876a7710 [ROCM][RL] Shuffle Weight In-Place to Preserve Parameter Attributes (#21825) 2026-04-02 23:43:55 -07:00
Zhangheng
4d097047f2 [PD]: Add support for HiSparse to directly transfer the cache from Prefill to Decode DRAM. (#21591)
Co-authored-by: Tingwei Huang <huangtingwei9988@gmail.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2026-04-02 23:06:12 -07:00
kk
5bcbc9757c [AMD] Resolve the performance degression when launch server with "--enable-aiter-allreduce-fusion" (#21947)
Co-authored-by: wunhuang <wunhuang@amd.com>
2026-04-02 22:10:24 -07:00
DarkSharpness
d1b7c3907d [Parallel State Refactor 2/n] Unify code path of AMD deterministic all reduce (#20871) 2026-04-03 12:33:17 +08:00
Baizhou Zhang
efa7b2d5d3 Revert "[MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine)" (#22002) 2026-04-02 20:42:13 -07:00
lviy
5f0df1e2ad [Bugfix] Fix incorrect dp-attention parallel info in bench_one_batch (#21519) 2026-04-02 20:13:53 -07:00
Yuhao Yang
69e89a1fcc [VLM] Enable per-image MM splitting by default and remove MULTI_IMAGES modality (#21899) 2026-04-03 11:04:41 +08:00
narutolhy
8897ac58f0 [PP] qwen3 vl skip layer id for pp (#19135) 2026-04-03 10:51:53 +08:00
Mook
991f3aa5b3 [Feature] NVFP4 Marlin fallback for non-Blackwell GPUs (SM75+) (#19652) 2026-04-03 10:48:15 +08:00
Khoa Pham
2b5aed94f5 Remove maxItems=1 restriction when tool_choice is specified (#20208) 2026-04-03 02:35:24 +00:00