Commit Graph

7855 Commits

Author SHA1 Message Date
Junrong Lin
6448de7e96 Reorganize topk logic to clean up code and expose logical experts (#16945) 2026-02-23 00:30:57 -08:00
Yuzhen Zhou
4f3dc1ef5b [ROCm] Use unreg path for custom all-reduce during CUDA graph capture (#19162) 2026-02-22 23:27:31 -08:00
Changyi Yang
c3c1532820 [diffusion] feat: detect Flux2 custom VAE path from component_paths (#19170) 2026-02-23 15:19:53 +08:00
Qiaolin Yu
42b1019881 Fix bench_one_batch_server by moving the print statements (#19175) 2026-02-22 22:06:25 -08:00
Baizhou Zhang
2472e47d73 Revert "Refactor graph input buffers (#18991)" (#19173) 2026-02-23 13:09:54 +08:00
Baizhou Zhang
fa80b9beba [CI] Skip some subtests for tool call parser (#19172) 2026-02-23 12:20:12 +08:00
Ratish P
0e670c36d6 fix(diffusion): enforce strict input_reference validation for T2V (#14825)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2026-02-23 12:04:06 +08:00
Baizhou Zhang
43f83525c0 Revert "[AMD] support two batch overlapping for mori ep #17953" (#19161) 2026-02-23 01:19:23 +08:00
Mick
45095bac70 [diffusion] refactor: rename quantized model path server arg (#19142) 2026-02-22 23:18:35 +08:00
Talor Abramovich
c6a99e43b9 Fix corrupted JSONL metrics file due to concurrent writes (#19011) 2026-02-22 20:33:27 +08:00
Mick
7d4860bc5e [diffusion] CI: relax perf check threshold (#19154) 2026-02-22 20:02:12 +08:00
Mick
87823722b3 [diffusion] chore: minor cleanups (#19123) 2026-02-22 19:07:25 +08:00
fzyzcjy
1a1c768d44 Support kwargs and megatron core tensor parsing in dumper (#19138) 2026-02-22 16:24:33 +08:00
Ziang Li
eddf193292 [DSv32] [GLM5] Improve Model Quality by Avoiding FP32 Precision Loss in weights_proj (#19041) 2026-02-22 16:20:51 +08:00
fzyzcjy
326b788ab4 Fix wrongly large dumped file and handle non intrusive hook reset in dumper (#19124) 2026-02-22 16:20:08 +08:00
fzyzcjy
c1f497e20e Enhance reset, states, http in dumper (#19095) 2026-02-22 16:17:41 +08:00
fzyzcjy
4091b720c5 Support multi colocated dumper, named exp cleanup, argparse config (#19094) 2026-02-22 16:16:15 +08:00
fzyzcjy
31f0c11405 Configure and call dumper in main SGLang logic (#19093) 2026-02-22 16:14:27 +08:00
fzyzcjy
cc63c99f11 Enhance hook mechanism in dumper (#19073) 2026-02-22 16:13:38 +08:00
fzyzcjy
fdf80b5031 Extract framework plugins in dumper (#19072) 2026-02-22 16:10:43 +08:00
fzyzcjy
e32b5364a2 Auto annotate context in dumper (#19071) 2026-02-22 16:08:48 +08:00
fzyzcjy
8bc0751376 Support enabling partial non intrusive dump in dumper (#19069) 2026-02-22 16:07:45 +08:00
fzyzcjy
0384c459a7 Support non-intrusive dumping in dumper (#19068) 2026-02-22 16:04:02 +08:00
fzyzcjy
5eccc3cff9 Refactor dumper and change on_forward_pass_start API (#19065) 2026-02-22 16:03:27 +08:00
Shangming Cai
eccee4c48e [PD] Change bootstrap_room metadata dtype from int64 to uint64 (#19141) 2026-02-22 14:20:16 +08:00
YAMY
5995bfec63 [Qwen3-Next] Enable fused_qkvzba_split_reshape_cat also for prefill (#18917) 2026-02-22 13:57:17 +08:00
Qiaolin Yu
8cf003c44b Fix spec v2+dp attention in nsa backend (#19134) 2026-02-22 13:46:15 +08:00
Bhavneek Singh
8612358ee9 [BUG] [DLLM] Missing max_running_requests value (#18740) 2026-02-22 13:38:56 +08:00
fy1214
ae62898ffb [diffusion] quant: adapt FP8 linear to sgld and support quant in flux (#17023)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
2026-02-22 12:55:28 +08:00
YAMY
cef353f338 [Fix] Quick fix for int32 overflow in Mooncakes' send_kvcache_slice (#19076) 2026-02-22 12:00:33 +08:00
Liangsheng Yin
4653939cda Revert "[jit kernel] Support per_token_group_quant_8bit jit kernel" (#19131) 2026-02-22 07:54:24 +08:00
Liangsheng Yin
1f2da824dd [Benchmark] Remove re-exports from bench_serving.py (#19130) 2026-02-21 14:30:30 -08:00
Ratish P
f158869c2c [Refactor] Benchmark Phase 1: extract utils and datasets from bench_serving (#19077)
Co-authored-by: Xuchun Shang <107600043+xucsh@users.noreply.github.com>
2026-02-21 13:50:11 -08:00
Mohammad Miadh Angkad
7b0fb43c7a [FlashInfer] Switch FlashInfer allreduce fusion to unified API (#18341) 2026-02-22 00:07:16 +08:00
Bi Xue
bf36aa4c31 [sgl] view could hold the memory too long and introduced large memory (#19109) 2026-02-21 23:40:56 +08:00
Xinyuan Tong
677b66af80 fix KimiK2Detector regex patterns with re.DOTALL (#19120)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2026-02-21 22:13:08 +08:00
Xinyuan Tong
4a362a0e04 fix tool handling in OpenAIServingChat (#18996)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2026-02-21 22:07:09 +08:00
Xiaoyu Zhang
66497ab0aa [Diffusion] Restruct and clean Diffusion rotary embedding (#19064) 2026-02-21 21:41:47 +08:00
DarkSharpness
d8d0208c63 [Feature] rewrite rope kernel; remove flashinfer dependencies (#18844)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-21 21:32:40 +08:00
Mick
6503f94211 [diffusion] feat: support passing component path via server args (#19108) 2026-02-21 21:22:47 +08:00
Xinyuan Tong
cc451671b5 [FEAT] Add Anthropic compatible API endpoint (#18630)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2026-02-21 19:37:38 +08:00
Mick
b89ca65789 [diffusion] refactor: reduce redundancy and improve stage api (#19060) 2026-02-21 16:35:47 +08:00
danielafrimi
33c33a7de9 [Quantization] Support config.json quantization_config format, fix exclude_modules matching, and fix KV cache scale loading for Nemotron (#18546)
Signed-off-by: root <dafrimi@nvidia.com>
2026-02-21 16:14:29 +08:00
Nicolas Castet
51b3ed02ca Fix bug in symm mem pre-allocation default (#19082) 2026-02-21 11:51:28 +08:00
Vladislav Nosivskoy
afd91e8782 [DSv32] Fix MTP and CP compatability (#19062)
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
2026-02-21 11:10:34 +08:00
Lianmin Zheng
463baafe10 [Auto Sync] Update batch_invariant_ops.py (20260221) (#19098)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jiayi Yuan <34369239+jy-yuan@users.noreply.github.com>
2026-02-20 18:27:31 -08:00
Lianmin Zheng
2928dfb8fa [Auto Sync] Update bench_one_batch_server_internal.py (20260221) (#19097)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com>
2026-02-20 18:19:14 -08:00
Cheng Wan
84c67c8be0 Refactor graph input buffers (#18991) 2026-02-20 18:09:31 -08:00
Minglei Zhu
4bffd3a232 [GPT-OSS] support fp8 online quantization for gpt-oss bf16 (#18988)
merge it as all required CI passed
2026-02-20 14:16:57 -08:00
Qiaolin Yu
96bae2355e Add generated-shared-prefix dataset in bench_one_batch (#18986) 2026-02-20 13:33:10 -08:00