Commit Graph

7744 Commits

Author SHA1 Message Date
Mick
80718492dd [diffusion] CI: reset thresholds (#22854) 2026-04-15 21:11:00 +08:00
Zhangheng
0a5c9728a1 [HiSparse][BugFix]: Fix the memory leak issue during health checks. (#22882) 2026-04-15 19:49:54 +08:00
Liangsheng Yin
ce31934ca8 Streaming session: fix retract tail leak via _free_tail (#22862) 2026-04-15 01:44:27 -07:00
huangtingwei
3511c2deb4 [HiCache] Fix memory host free logic when share_indices_with_anchor enabled (#22767)
Co-authored-by: hzh0425 <hzh0425@apache.org>
2026-04-15 16:31:18 +08:00
Liangsheng Yin
aa78564e1a Refactor streaming session abort handling (#22790) 2026-04-15 00:13:05 -07:00
Hubert Lu
b2af34be54 [AMD] Optimize _append_shared_to_topk_output by a single fused Triton kernel for Qwen3.5 (#22844)
Co-authored-by: HaiShaw <hixiao@gmail.com>
2026-04-14 23:50:32 -07:00
Mick
e95c2e73bd [diffusion] CI: refactor diffusion ci and reduce redundancy (#22810) 2026-04-15 10:12:29 +08:00
Kangrui Du
47ac830c07 [diffusion] rl: support standalone rollout api, denoising environment backpass and sp-aligned log-prob for T2I post-training (#22604)
Co-authored-by: MikukuOvO <mikukuovo@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 10:10:38 +08:00
Lianmin Zheng
adb310b976 Cleanup server_args.py and minor code tidying (#22820) 2026-04-14 18:52:41 -07:00
Chen, Zhentao
ea05ea5abe [AMD] Enable share expert fusion with router experts for Qwen3.5 BF16 & FP8 (#20736)
Co-authored-by: Chen, Todd <zhenchen@amd.com>
Co-authored-by: jacky.cheng <yichiche@amd.com>
2026-04-14 18:52:36 -07:00
Piotr Mazurek
46c8a597ef [VLM] fix LFM2-VL offline inference and GPU JPEG decode (#22448) 2026-04-15 09:13:25 +08:00
Alex Nails
8092431316 [serving] replace O(n²) stream_buffer string concat with integer offset (#22606)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-04-14 14:48:44 -07:00
Liangsheng Yin
36891ab514 Rename _alive_streaming_session_count; use _is_streaming helper (#22755) 2026-04-14 13:26:03 -07:00
Liangsheng Yin
0cb7295698 Fix streaming session busy-check double-counting via active_pool_idxs (#22753) 2026-04-14 13:11:06 -07:00
mingyue300
b4616dcbf5 [BugFix] Fix EAGLE speculative decoding missing grammar-based finish … (#21723) 2026-04-14 12:43:50 -07:00
Mick
d2f479e544 [diffusion] chore: auto-enable best parallel setting if unspecified (#22763)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 00:02:05 +08:00
Bi Xue
070c6a2489 [sgl] perf optimization for eplb (#21232) 2026-04-14 22:52:17 +08:00
Mick
c5e95080d2 [diffusion] model: support Ltx 2.3 two stage ti2v (#22667) 2026-04-14 22:10:08 +08:00
lawtherWu
454228e071 hicache storage backend mooncake support ascend hixl (#20016) 2026-04-14 20:51:06 +08:00
Jia Guo
6da3aba6a5 perf: optimize PCG inductor path for FP8 models (#21734)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 17:51:27 +08:00
xutizhou
3cb3f7c018 fix: EPLB dispatch OOB when shared experts fusion enabled under DeepEP (#22525) 2026-04-14 02:33:27 -07:00
Jincong Chen
6760c790bd [bugfix] avoid attention padding tokens computation in pcg (#17706) 2026-04-14 16:08:23 +08:00
Michael
eab045b2b7 [AMD] Add MiniMax-M2.7 accuracy and performance nightly tests (#22722)
Co-authored-by: HaiShaw <hixiao@gmail.com>
2026-04-14 00:30:11 -07:00
xiaobochen-amd
d7ecab5113 [ROCm]fix(aiter): cast fp8 prefill output back to model dtype (#22626)
Co-authored-by: kk <43161300+kkHuang-amd@users.noreply.github.com>
2026-04-14 00:25:09 -07:00
Xiaoyu Zhang
f97c608caa [diffusion] quant: add FLUX.1-dev modelopt nvfp4 support (#22672) 2026-04-14 15:00:59 +08:00
Colin Z
b10f852118 GLM-5/5.1 MXFP4 Checkpoint Inference Compatibility Fix (#22543)
Co-authored-by: HAI <hixiao@gmail.com>
2026-04-13 23:56:48 -07:00
YAMY
657945c338 Replace all-reduce + dp_scatter with reduce_scatterv for DP attention (#22642) 2026-04-13 21:51:10 -07:00
ishandhanani
520ce526b9 Restore Qwen3 rope config fallback (#22739) 2026-04-13 21:47:37 -07:00
Xuwei
a9a2ae4a68 [Anthropic] Fix clock mismatch in received_time causing negative Prometheus metrics (#22247)
Signed-off-by: Xuwei Li <lixuwei.xy@gmail.com>
2026-04-13 21:22:00 -07:00
huangtingwei
e9d6b9eb2d [HiCache & HybridModel] mooncake backend support DSA & mamba model (#21259)
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Co-authored-by: hzh0425 <hzh0425@apache.org>
Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>
Co-authored-by: ispobock <ispobaoke@gmail.com>
Co-authored-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
2026-04-13 18:47:36 -07:00
ishandhanani
cc449ac4e5 feat(metrics): expose raw KV cache pool token counts as prometheus gauges (#22726) 2026-04-13 18:30:36 -07:00
huangtingwei
945d73824f [HiSparse] Clarify decode token usage logs (#22331) 2026-04-13 18:03:25 -07:00
yuki-brook
1ec018f27a [Feature] Add SiMM as sglang HiCache Storage backend (#18016) 2026-04-13 17:12:37 -07:00
Liangsheng Yin
33a3ba256f Delete dead rematch path in SessionAwareCache.release_session (#22735) 2026-04-13 17:02:40 -07:00
Lianmin Zheng
9fb00ede15 Clean up TokenizerManager and req_time_stats: reduce overhead and simplify (#21646) 2026-04-13 16:47:32 -07:00
Jia Guo
a2b5111962 perf: skip KV cache in FA backend for embedding mode (#21971)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-13 16:27:52 -07:00
Lianmin Zheng
8f9553bccb [Misc] Migrate SGLANG_SET_CPU_AFFINITY to envs and refactor model config building (#22730) 2026-04-13 16:10:31 -07:00
mqhc2020
f4f9e68189 [AMD] Add MoE weights and scales padding (#21097)
Co-authored-by: HAI <hixiao@gmail.com>
2026-04-13 15:50:15 -07:00
Yilong Zhao
b1efce342c env: add knob to control SWA eviction interval (#22645) 2026-04-13 15:37:59 -07:00
Lianmin Zheng
f81b6e8f51 [Misc] Add @cache_once to is_arch_support_pdl in jit_kernel (#22724) 2026-04-13 14:42:49 -07:00
Baizhou Zhang
b441317aa4 Revert "Upgrade CI default CUDA version from 12.9 to 13.0" (#22727) 2026-04-13 14:39:24 -07:00
Lianmin Zheng
ba7bcca6b3 Use reshape instead of contiguous().view() in TRTLLMHAAttnBackend (#22517) 2026-04-13 14:29:12 -07:00
Kurt Shuster
ff13dfee45 [lora][moe] Virtual experts for LoRA MoE (#22122)
Co-authored-by: Yusheng Su <yushengsu.thu@gmail.com>
2026-04-13 21:19:30 +00:00
ishandhanani
6b2bf66cd9 fix[glm4.7 flash]: properly detect gfx95_quant_format (#22720) 2026-04-13 13:10:07 -07:00
Asish Kumar
39810762d2 fix: use describe mode for SGLang version detection (#22600)
Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>
2026-04-13 09:45:45 -07:00
DarkSharpness
314d6ecf08 [Feature][JIT Kernel] Fused TP QK norm For Minimax (#20673)
Co-authored-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
2026-04-13 20:29:47 +08:00
Xiaole Guo
4df60434d7 [diffusion] model: support stable-diffusion-3-medium-diffusers (#19225)
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
Co-authored-by: Kangrui Du <kangruidu@gmail.com>
Co-authored-by: Xiaole Guo <gxlvera@gmail.com>
2026-04-13 16:07:06 +08:00
Chandrakant Khandelwal
1e9eecfa36 [Intel GPU] Enable sgl-kernel-xpu fused_experts MoE kernel path for GPT-OSS bf16 models. (#22417) 2026-04-13 13:45:48 +08:00
Mick
d524f110ac [diffusion] refactor: streamline denoising stages (#22633) 2026-04-13 13:34:37 +08:00
Polisetty V R K Jyothendra Varma
7d2c11970c [Intel GPU] Upgrade pytorch xpu version to 2.11 (#21908)
Signed-off-by: P V R K Jyothendra Varma <polisetty.v.r.k.jyothendra.varma@intel.com>
Co-authored-by: Ma Mingfei <mingfei.ma@intel.com>
2026-04-13 13:16:24 +08:00