Commit Graph

6820 Commits

Author SHA1 Message Date
MARATRIX
069d4c577b Fix Kimi K2.5 PP layer range exposure for PD disaggregation (#19959)
Signed-off-by: yafeng.li <yafeng.li@mthreads.com>
2026-03-06 16:14:02 -08:00
Liangsheng Yin
ddcecdea49 [Core] Unify max_num_reqs dp_size division for pool sizing (#20063)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-06 16:12:59 -08:00
Kangyan-Zhou
7a12255b6e fix: set first_token_time before computing decode_throughput for single-batch completions (#19984)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 16:11:41 -08:00
Aurick Qiao
5c8e28698c Add cleanup for _ATTN_TP in parallel_state.py (#19978) 2026-03-06 15:43:31 -08:00
Shu Wang
61de303f0a Fix fallback to default tactic (flashinfer autotuner) with trtllm_fp4_block_scale_moe (#19189) 2026-03-06 15:15:04 -08:00
Kangyan-Zhou
e89069ee64 Fallback to torch.cuda.mem_get_info() when nvidia-smi is unavailable (#18957)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 15:00:08 -08:00
Liangsheng Yin
604db4471d [Core] Clarify memory variable naming in model runner (#20060) 2026-03-06 14:00:46 -08:00
Liangsheng Yin
7a6cf0e9ba [Core] Extract _calculate_mamba_ratio and _init_pools from init_memory_pool (#20058) 2026-03-06 13:37:22 -08:00
Mohammad Miadh Angkad
759700c808 Fix SM120 triton_kernels MXFP4 block_k for GPT-OSS (#20040) 2026-03-06 10:53:08 -08:00
R0CKSTAR
de1a0afcbc [MUSA][10/N] Add GGUF support (#18357)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2026-03-06 10:50:35 -08:00
JohnHerry
e8f2b80340 [diffusion] improve: improve code readability of DenoisingStage (#20003) 2026-03-06 23:23:44 +08:00
xingsy97
54634b9a40 [Kernel] Dispatch exp/sin/cos through dtype_trait (#19798) 2026-03-06 22:57:52 +08:00
Johnsonms
2d266c73ea Migrate renorm kernels from sgl-kernel to FlashInfer JIT (#18854) 2026-03-06 22:53:28 +08:00
Xiaoyu Zhang
6d22c9f369 [Diffusion] Move hf kernels diffusion cuda kernels skills to SGLD (#20001)
Co-authored-by: Mick <mickjagger19@icloud.com>
2026-03-06 22:16:06 +08:00
Yuan Luo
f7de9375ac [GDN][Qwen3-Next][Qwen3.5] Fuse fused_gdn_gating and fused_recurrent_gated_delta_rule_update in verify_target (#19775) 2026-03-06 21:42:44 +08:00
Prozac614
e3b581ce6b [diffusion] fix: remove num_frames in wan2_1_t2v_1_3b_lora_1gpu test (#20009)
Co-authored-by: daiweitao <dwti614707404@163.com>
2026-03-06 21:36:43 +08:00
Kangyan-Zhou
25e678d933 [diffusion] endpoint: add /server_info and /model_info endpoints for gateway discovery (#20020)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 21:36:13 +08:00
inkcherry
84aaa69795 [AMD] Use bfloat16 for correction_bias in AITER FP8 path to avoid runtime dtype conversion for dsv3 (#19843) 2026-03-06 00:57:12 -08:00
Clint
27053aa5ed Fix MLA decode path returning unwritten (padded) rows (#19902) 2026-03-06 00:54:29 -08:00
xdtbynd
0252ca8255 [Bugfix] Fix the bug blocking the startup of Llama-3.2-11b-Vision-Instruct (#19638)
Co-authored-by: sglang-npu-bot <sglangnpu@163.com>
2026-03-06 16:21:50 +08:00
Zheng Wengang
da27d9bff6 [Bug-Fix][EPD]: skip log waiting-image-req for zmq_to_tokenzer/mooncake (#19555) 2026-03-06 14:39:22 +08:00
Mook
be9a9e4819 refactor(multimodal/test): centralize model names and shared utilities in test_utils (#19354)
Co-authored-by: Ratish P <114130421+Ratish1@users.noreply.github.com>
2026-03-05 20:09:42 -08:00
Baizhou Zhang
51e5dc845a Revert "[Kernel Slimming] Migrate NVFP4 kernels to JIT" (#20005) 2026-03-05 19:40:00 -08:00
sushil Dubey
6e5a2de354 [diffusion] fix: fix reading multiple prompts from prompt file (#19075)
Signed-off-by: Sushil Dubey <sushil.dubey@intel.com>
2026-03-06 11:23:31 +08:00
Simo Lin
9502369488 fix(grpc): add server-side keepalive options to prevent GOAWAY (#19986)
Signed-off-by: Simo Lin <linsimo.mark@gmail.com>
2026-03-05 18:56:35 -08:00
liupeng374
5471e4a492 [NPU][Feature] eliminate dsv3 redundant rotary embed calculation (#19842) 2026-03-06 09:02:14 +08:00
chenxu214
b912d7ae19 [OPT]Skip the first delayer to maximize the BS of the decoding. (#19836) 2026-03-06 08:53:19 +08:00
shadowxz109
261be85ecc Support mrope_position_delta cache 2026-03-06 08:50:53 +08:00
Xinyuan Tong
9ebffef1ef [FIX] NSA backend page_table overflow in speculative decoding target_verify (#19016) 2026-03-05 16:04:58 -08:00
Ajay Anubolu
13af7cbb02 fix: use consistent time denominator for throughput metrics in bench_one_batch_server (#19223) 2026-03-05 15:58:17 -08:00
Chang Su
dd2bbe6d62 fix(grpc): use context.abort() with proper status codes instead of in-band errors (#19972)
Signed-off-by: Chang Su <chang.s.su@oracle.com>
2026-03-05 14:53:18 -08:00
Qiaolin Yu
46dced64ea Adjust padding size to improve triton_kernels moe performance (#19174) 2026-03-05 14:50:40 -08:00
kpham-sgl
346a4131cf [Spec] Refactor NaN/OOB checks to async maybe_detect_* with env-var control (#19899)
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
2026-03-05 13:51:05 -08:00
Xinyu Zhang
b3cfad0a80 Add Ray actor support for scheduler process management (DP=1) (#17684)
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-03-05 13:21:23 -08:00
sglang-bot
ebb66cc1de [misc] Priority scheduling metrics cleanup (#19927)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 12:42:42 -08:00
danielafrimi
ff6048fb9c rename nemotron reasoning parser (#19865)
Signed-off-by: dafrimi <dafrimi@nvidia.com>
2026-03-05 11:27:07 -08:00
Mohammad Miadh Angkad
41fd53fe37 Fix profile_activities parameter name in bench_one_batch_server_internal.py (#19954) 2026-03-05 10:34:06 -08:00
akhilg-nv
73d272bddb Revised fix for HybridAttnBackend forward for linear attn (#19369) 2026-03-06 00:05:35 +08:00
Zheng Wengang
0de0d74195 [EPD][Feat]support adaptive forward (#18118) 2026-03-05 21:12:30 +08:00
StonyPort
806d41ab65 [quant] fix fp32 downcasting (#19844)
Co-authored-by: qiuxuan.lzw <qiuxuan.lzw@alibaba-inc.com>
2026-03-05 17:54:59 +08:00
Rain Jiang
472eef4071 fa4 cleanup (#19727) 2026-03-05 17:54:25 +08:00
Chi McIsaac
c36de62bfc [diffusion] fix images/edit with 2 images (#17520)
Signed-off-by: Chi McIsaac <chixie.mcisaac@gmail.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
2026-03-05 16:56:39 +08:00
xingsy97
dbc896f204 [Test] Enhance JIT kvcache store kernel test coverage (#19630) 2026-03-05 16:17:15 +08:00
Tiwei Bie
727face6c2 [DLLM] Add initial radix cache support (#18724) 2026-03-04 23:24:09 -08:00
Kalyan Kumar
c1df359b44 Add XPU profiler activity support in benchmark code (#12981)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-04 23:22:56 -08:00
Mohammad Miadh Angkad
2bdd89a6cd [Kernel Slimming] Migrate NVFP4 kernels to JIT (#19437) 2026-03-05 15:22:28 +08:00
Yilong Zhao
1bbfed0539 [misc] add env for http keep alive timeout (#19847) 2026-03-04 22:00:51 -08:00
Chenxi Li
86c5617787 [BUG]: fix prevent illegal memory access in Mamba SSM tracking during EAGLE speculative verification (#19415)
Co-authored-by: ConnorLi96 <ConnorLi96@users.noreply.github.com>
2026-03-04 21:13:21 -08:00
Baizhou Zhang
10c65df48a [Bug] Fix lora tp bug on H200 (#19769) 2026-03-04 20:11:02 -08:00
Xinyi Song
0e6a64712a [bugfix] Fix PPMissingLayer AttributeError when Using PP (#19804) 2026-03-04 19:48:15 -08:00