Liangsheng Yin
f9792166c3
trim_overshoot: cap swa_evicted_seqlen + unit test ( #22900 )
2026-04-15 15:05:35 -07:00
Xinyu Zhang
13a2cd748d
[Ray] Add data parallel (DP) and DP attention support to RayEngine ( #21887 )
...
Co-authored-by: xyuzh <xyuzh@users.noreply.github.com >
2026-04-15 15:00:48 -07:00
Sundara Raman Ramachandran
4927975427
[Score API] Add return_pooled_hidden_states to Scoring API for SequenceClassification / RewardModel ( #22427 )
2026-04-15 14:58:56 -07:00
Lee Nau
4e480d5785
Harden FlashInfer FP4 imports in standard dispatcher ( #21776 )
2026-04-15 14:54:49 -07:00
Liangsheng Yin
efc267ca29
streaming session: trim spec v2 overshoot in cache_finished_req ( #22897 )
2026-04-15 14:15:46 -07:00
Lianmin Zheng
43925d179d
[Speculative] Fix Eagle3/DFLASH aux hidden state capture during CUDA graph init ( #22836 )
2026-04-15 14:04:54 -07:00
Kurt Shuster
32d9fe5a32
[lora] Speedup triton backend sgemm calls with better grid ( #22386 )
2026-04-15 13:47:07 -07:00
Jimmy Shong
28e915b474
[Bugfix] Preserve auto-detected quant_config for GLM NextN draft model ( #22823 )
2026-04-15 13:25:36 -07:00
Yuhao Yang
8686f42acb
[VLM] Enable per-image ViT cache and avoid TP CUDA context creation for Kimi-K2.5 ( #22858 )
2026-04-16 01:14:24 +08:00
huangtingwei
7d7fdc1309
[HiCache]Fix CP support for hybrid model ( #22782 )
...
Co-authored-by: hzh0425 <hzh0425@apache.org >
2026-04-15 23:50:29 +08:00
Xiaoyu Zhang
695ab705cb
[diffusion] quant: update modelopt quantization docs and CI coverage ( #22772 )
2026-04-15 21:30:28 +08:00
Mick
80718492dd
[diffusion] CI: reset thresholds ( #22854 )
2026-04-15 21:11:00 +08:00
Zhangheng
0a5c9728a1
[HiSparse][BugFix]: Fix the memory leak issue during health checks. ( #22882 )
2026-04-15 19:49:54 +08:00
Liangsheng Yin
ce31934ca8
Streaming session: fix retract tail leak via _free_tail ( #22862 )
2026-04-15 01:44:27 -07:00
huangtingwei
3511c2deb4
[HiCache] Fix memory host free logic when share_indices_with_anchor enabled ( #22767 )
...
Co-authored-by: hzh0425 <hzh0425@apache.org >
2026-04-15 16:31:18 +08:00
Liangsheng Yin
aa78564e1a
Refactor streaming session abort handling ( #22790 )
2026-04-15 00:13:05 -07:00
Hubert Lu
b2af34be54
[AMD] Optimize _append_shared_to_topk_output by a single fused Triton kernel for Qwen3.5 ( #22844 )
...
Co-authored-by: HaiShaw <hixiao@gmail.com >
2026-04-14 23:50:32 -07:00
Mick
e95c2e73bd
[diffusion] CI: refactor diffusion ci and reduce redundancy ( #22810 )
2026-04-15 10:12:29 +08:00
Kangrui Du
47ac830c07
[diffusion] rl: support standalone rollout api, denoising environment backpass and sp-aligned log-prob for T2I post-training ( #22604 )
...
Co-authored-by: MikukuOvO <mikukuovo@gmail.com >
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-15 10:10:38 +08:00
Lianmin Zheng
adb310b976
Cleanup server_args.py and minor code tidying ( #22820 )
2026-04-14 18:52:41 -07:00
Chen, Zhentao
ea05ea5abe
[AMD] Enable share expert fusion with router experts for Qwen3.5 BF16 & FP8 ( #20736 )
...
Co-authored-by: Chen, Todd <zhenchen@amd.com >
Co-authored-by: jacky.cheng <yichiche@amd.com >
2026-04-14 18:52:36 -07:00
Piotr Mazurek
46c8a597ef
[VLM] fix LFM2-VL offline inference and GPU JPEG decode ( #22448 )
2026-04-15 09:13:25 +08:00
Alex Nails
8092431316
[serving] replace O(n²) stream_buffer string concat with integer offset ( #22606 )
...
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-04-14 14:48:44 -07:00
Liangsheng Yin
36891ab514
Rename _alive_streaming_session_count; use _is_streaming helper ( #22755 )
2026-04-14 13:26:03 -07:00
Liangsheng Yin
0cb7295698
Fix streaming session busy-check double-counting via active_pool_idxs ( #22753 )
2026-04-14 13:11:06 -07:00
mingyue300
b4616dcbf5
[BugFix] Fix EAGLE speculative decoding missing grammar-based finish … ( #21723 )
2026-04-14 12:43:50 -07:00
Mick
d2f479e544
[diffusion] chore: auto-enable best parallel setting if unspecified ( #22763 )
...
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-15 00:02:05 +08:00
Bi Xue
070c6a2489
[sgl] perf optimization for eplb ( #21232 )
2026-04-14 22:52:17 +08:00
Mick
c5e95080d2
[diffusion] model: support Ltx 2.3 two stage ti2v ( #22667 )
2026-04-14 22:10:08 +08:00
lawtherWu
454228e071
hicache storage backend mooncake support ascend hixl ( #20016 )
2026-04-14 20:51:06 +08:00
Jia Guo
6da3aba6a5
perf: optimize PCG inductor path for FP8 models ( #21734 )
...
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-14 17:51:27 +08:00
xutizhou
3cb3f7c018
fix: EPLB dispatch OOB when shared experts fusion enabled under DeepEP ( #22525 )
2026-04-14 02:33:27 -07:00
Jincong Chen
6760c790bd
[bugfix] avoid attention padding tokens computation in pcg ( #17706 )
2026-04-14 16:08:23 +08:00
Michael
eab045b2b7
[AMD] Add MiniMax-M2.7 accuracy and performance nightly tests ( #22722 )
...
Co-authored-by: HaiShaw <hixiao@gmail.com >
2026-04-14 00:30:11 -07:00
xiaobochen-amd
d7ecab5113
[ROCm]fix(aiter): cast fp8 prefill output back to model dtype ( #22626 )
...
Co-authored-by: kk <43161300+kkHuang-amd@users.noreply.github.com >
2026-04-14 00:25:09 -07:00
Xiaoyu Zhang
f97c608caa
[diffusion] quant: add FLUX.1-dev modelopt nvfp4 support ( #22672 )
2026-04-14 15:00:59 +08:00
Colin Z
b10f852118
GLM-5/5.1 MXFP4 Checkpoint Inference Compatibility Fix ( #22543 )
...
Co-authored-by: HAI <hixiao@gmail.com >
2026-04-13 23:56:48 -07:00
YAMY
657945c338
Replace all-reduce + dp_scatter with reduce_scatterv for DP attention ( #22642 )
2026-04-13 21:51:10 -07:00
ishandhanani
520ce526b9
Restore Qwen3 rope config fallback ( #22739 )
2026-04-13 21:47:37 -07:00
Xuwei
a9a2ae4a68
[Anthropic] Fix clock mismatch in received_time causing negative Prometheus metrics ( #22247 )
...
Signed-off-by: Xuwei Li <lixuwei.xy@gmail.com >
2026-04-13 21:22:00 -07:00
huangtingwei
e9d6b9eb2d
[HiCache & HybridModel] mooncake backend support DSA & mamba model ( #21259 )
...
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com >
Co-authored-by: hzh0425 <hzh0425@apache.org >
Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com >
Co-authored-by: ispobock <ispobaoke@gmail.com >
Co-authored-by: Vladislav Nosivskoy <vladnosiv@gmail.com >
2026-04-13 18:47:36 -07:00
ishandhanani
cc449ac4e5
feat(metrics): expose raw KV cache pool token counts as prometheus gauges ( #22726 )
2026-04-13 18:30:36 -07:00
huangtingwei
945d73824f
[HiSparse] Clarify decode token usage logs ( #22331 )
2026-04-13 18:03:25 -07:00
yuki-brook
1ec018f27a
[Feature] Add SiMM as sglang HiCache Storage backend ( #18016 )
2026-04-13 17:12:37 -07:00
Liangsheng Yin
33a3ba256f
Delete dead rematch path in SessionAwareCache.release_session ( #22735 )
2026-04-13 17:02:40 -07:00
Lianmin Zheng
9fb00ede15
Clean up TokenizerManager and req_time_stats: reduce overhead and simplify ( #21646 )
2026-04-13 16:47:32 -07:00
Jia Guo
a2b5111962
perf: skip KV cache in FA backend for embedding mode ( #21971 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-04-13 16:27:52 -07:00
Lianmin Zheng
8f9553bccb
[Misc] Migrate SGLANG_SET_CPU_AFFINITY to envs and refactor model config building ( #22730 )
2026-04-13 16:10:31 -07:00
mqhc2020
f4f9e68189
[AMD] Add MoE weights and scales padding ( #21097 )
...
Co-authored-by: HAI <hixiao@gmail.com >
2026-04-13 15:50:15 -07:00
Yilong Zhao
b1efce342c
env: add knob to control SWA eviction interval ( #22645 )
2026-04-13 15:37:59 -07:00