sglang-bot
46bf19cdab
chore: bump flashinfer version to 0.6.7.post2 ( #22097 )
...
Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com >
2026-04-04 02:16:25 -07:00
narutolhy
24763256b9
[Speculative Decoding] Add FA4-based Spec Support ( #21080 )
...
Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com >
2026-04-04 02:09:45 -07:00
Yuhao Yang
34d5765e2f
[VLM] Chunk-aware ViT encoding with per-image cache and lazy device transfer ( #22038 )
2026-04-04 16:55:17 +08:00
Piotr Mazurek
b5e8c4b9e3
model: support LFM2-VL (Liquid Foundation Model 2 Vision-Language) ( #21230 )
...
Co-authored-by: Piotr Mazurek <piotr.mazurek@liquid.ai >
2026-04-04 16:36:04 +08:00
R0CKSTAR
1fb4bf3558
[diffusion] fix: validate attention backend for Ring Attention in USPAttention ( #21828 )
...
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
2026-04-04 16:24:02 +08:00
harrisonlimh
9fa12d605a
Add dsv3 router gemm benchmark on blackwell ( #17707 )
2026-04-04 01:18:01 -07:00
Xiaoyu Zhang
82ea4906cf
[diffusion] Default NVFP4 to CUTLASS and add all-model shape benchmarks ( #22091 )
2026-04-04 16:14:38 +08:00
Ethan (Yusheng) Su
ff8e47edf9
[5/n] Lora support cuda graph ( #21647 )
2026-04-04 00:31:46 -07:00
Douglas Yang
a94c3804c2
fix: mistral embedding regression fix ( #21913 )
2026-04-04 00:11:51 -07:00
Chi McIsaac
005e582d06
[diffusion] improve: norm fusion for z-image ( #18762 )
...
Signed-off-by: Chi McIsaac <chixie.mcisaac@gmail.com >
Co-authored-by: yihanc <yingluosanqian@gmail.com >
Co-authored-by: Mick <mickjagger19@icloud.com >
2026-04-04 14:01:01 +08:00
Qiaolin Yu
ef13031243
Tiny fix step3.5-flash launch crash ( #22076 )
2026-04-03 22:25:25 -07:00
Ziang Li
990c7590b8
[RL] Support mxfp8 DeepSeek V3 ( #21280 )
2026-04-03 21:57:45 -07:00
faceless void
de9859073f
Add --stream-response-default-include-usage server flag ( #16711 )
2026-04-03 21:36:00 -07:00
CHEN Xi
31c9d8e885
[Diffusion] Fix weight scale swizzle and add large-M kernel config for FLUX.2-dev-NVFP4 ( #22064 )
2026-04-04 11:50:30 +08:00
Yilong Zhao
fe92f3563c
dp: add profile req hook ( #22083 )
2026-04-03 20:47:09 -07:00
Yuxuan Zhang
b7ae3b5a9a
GLM-4.7 and GLM-4.7-Flash Loading and import format ( #21851 )
2026-04-03 20:44:08 -07:00
Prozac614
db3d4f4b76
[diffusion] model: support two stage pipeline of LTX-2 ( #20707 )
...
Co-authored-by: daiweitao <dwti614707404@163.com >
Co-authored-by: Mick <mickjagger19@icloud.com >
Co-authored-by: GMI Xiao Jin <xiao.j@gmicloud.ai >
2026-04-04 09:37:28 +08:00
Liangsheng Yin
95cdbce34f
[Test] Extract common PD server setup into base fixture ( #22080 )
2026-04-03 16:37:12 -07:00
Lawrence Wu
9593d434c4
fix: pause_generation should not populate running_batch on prefill nodes ( #20273 )
2026-04-03 16:16:06 -07:00
Sundara Raman Ramachandran
90e86800f4
[Score API] Implement EngineScoreMixin for scoring functionality and refactor Tok… ( #21342 )
2026-04-03 15:17:42 -07:00
Baizhou Zhang
ac1e437f6a
Revert "[Feature] JIT activation and update skills (by codex)" ( #22078 )
2026-04-03 15:04:15 -07:00
Mohammad Miadh Angkad
8cb337c8ea
[Bugfix] Temporarily skip TRTLLM attention on (G)B300 (SM103) to avoid high-concurrency hang ( #21906 )
...
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com >
2026-04-03 14:19:13 -07:00
Yz Xiao
1d7a53dd03
[Fix] XGrammarGrammarBackend reset to clear inherited cache ( #22054 )
2026-04-03 14:17:59 -07:00
sglang-bot
84118acf50
chore: bump sglang-kernel version to 0.4.1 ( #22009 )
...
Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com >
2026-04-03 13:58:35 -07:00
Lianmin Zheng
eb407b80f3
[Kernel] Make FA3/FA4 imports lazy in FlashAttentionBackend ( #22028 )
2026-04-03 13:49:00 -07:00
Brayden Zhong
6aafe756b9
Revert "[Feature] NVFP4 Marlin fallback for non-Blackwell GPUs (SM75+… ( #22047 )
2026-04-03 13:12:30 -07:00
Shiyan Deng
0c9dc098e7
Fix DP attention worker port binding for IPv6 support ( #21917 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
2026-04-03 12:39:39 -07:00
Zhangheng
ed3435e37f
[HiSparse]: Optimize server args checking-HiSparse is temporarily only available for DSA models. ( #22065 )
2026-04-04 02:23:56 +08:00
Mick
151f727163
[diffusion] fix: fix gated repo failing the generate cmd ( #22040 )
...
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-04-04 00:43:11 +08:00
DarkSharpness
44e5d35703
[Feature][JIT Kernel] JIT activation and update skills (by codex) ( #21766 )
...
Co-authored-by: weiminc <tnwilly@gmail.com >
2026-04-03 23:28:54 +08:00
Mick
030fb1c4b1
refactor: replace mm_inputs dict with MultimodalProcessorOutput ( #21738 )
2026-04-03 23:26:37 +08:00
Ke Bao
9f409d0749
[CI] Adjust CI server launch timeout ( #22045 )
2026-04-03 22:38:07 +08:00
Xiaoyu Zhang
ee9d922f5a
Revert "[Kernel] Fuse temperature + softmax in sampling for decode speedup" ( #22046 )
2026-04-03 21:32:08 +08:00
Kangyan-Zhou
56ac9c9932
[Fix] Add _MOE_TP to graph_capture for MoE models with ep>1 ( #21907 )
...
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com >
2026-04-03 02:33:16 -07:00
Khoa Pham
cd75d54fc5
[Bugfix] Fix CUDA graph replay issues in trtllm_mla draft_extend ( #21987 )
...
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-03 01:45:13 -07:00
shuwenn
4f84ce5807
[CI] ci: add test_http_server_auth.py to CI ( #21866 )
2026-04-03 16:32:18 +08:00
Thomas Wang
7431db7392
[AMD] Enable FP8 KV cache and FP8 attention kernel for NSA on MI300/MI355 with TileLang backend ( #21511 )
2026-04-03 00:58:23 -07:00
Kelon
ad0516d9c1
[NPU] optimize glm4.7 ( #19246 )
2026-04-03 15:44:07 +08:00
Shangming Cai
d82097a0df
[PD] Tiny register info field cleanup for mooncake backend ( #22016 )
2026-04-03 15:13:44 +08:00
Ricardo-M-L
24f52e66d3
fix: remove duplicate words in comments ( #22007 )
2026-04-03 00:05:39 -07:00
Yuzhen Zhou
6b876a7710
[ROCM][RL] Shuffle Weight In-Place to Preserve Parameter Attributes ( #21825 )
2026-04-02 23:43:55 -07:00
Zhangheng
4d097047f2
[PD]: Add support for HiSparse to directly transfer the cache from Prefill to Decode DRAM. ( #21591 )
...
Co-authored-by: Tingwei Huang <huangtingwei9988@gmail.com >
Co-authored-by: Shangming Cai <csmthu@gmail.com >
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu >
2026-04-02 23:06:12 -07:00
kk
5bcbc9757c
[AMD] Resolve the performance degression when launch server with "--enable-aiter-allreduce-fusion" ( #21947 )
...
Co-authored-by: wunhuang <wunhuang@amd.com >
2026-04-02 22:10:24 -07:00
DarkSharpness
d1b7c3907d
[Parallel State Refactor 2/n] Unify code path of AMD deterministic all reduce ( #20871 )
2026-04-03 12:33:17 +08:00
Baizhou Zhang
efa7b2d5d3
Revert "[MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine)" ( #22002 )
2026-04-02 20:42:13 -07:00
lviy
5f0df1e2ad
[Bugfix] Fix incorrect dp-attention parallel info in bench_one_batch ( #21519 )
2026-04-02 20:13:53 -07:00
Yuhao Yang
69e89a1fcc
[VLM] Enable per-image MM splitting by default and remove MULTI_IMAGES modality ( #21899 )
2026-04-03 11:04:41 +08:00
narutolhy
8897ac58f0
[PP] qwen3 vl skip layer id for pp ( #19135 )
2026-04-03 10:51:53 +08:00
Mook
991f3aa5b3
[Feature] NVFP4 Marlin fallback for non-Blackwell GPUs (SM75+) ( #19652 )
2026-04-03 10:48:15 +08:00
Khoa Pham
2b5aed94f5
Remove maxItems=1 restriction when tool_choice is specified ( #20208 )
2026-04-03 02:35:24 +00:00