Commit Graph

7855 Commits

Author SHA1 Message Date
Shu Wang
5638d40f3a [nvidia] Gemma4 nvfp4 fix (#22079) 2026-04-10 08:44:34 +08:00
Tarushii Goel
cebd9c2a1e [sgl] add ability to return logprobs in MultiLayerEagleWorkerV2 (#22241) 2026-04-09 16:20:55 -07:00
Mohammad Miadh Angkad
c3833ba929 Enable DFLASH support for additional model backends (#22358)
Co-authored-by: David Wang <21328423+dcw02@users.noreply.github.com>
2026-04-09 14:36:12 -07:00
Ethan (Yusheng) Su
28ef6de091 [Lora] Lora quat info re-factor and support deepseekv3 mla lora (#22323) 2026-04-09 14:19:58 -07:00
Baizhou Zhang
60acdc31f2 [Fix] Fix several bugs on DSA models (#22430) 2026-04-09 12:46:23 -07:00
Baizhou Zhang
606aa11ea8 [DSA] Enable all reduce fusion for DSA models (#22390) 2026-04-09 12:42:44 -07:00
Lawrence Wu
8eb235ab51 fix: do not strip whitespace from GLM tool call values (#20543)
Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>
2026-04-09 11:14:15 -07:00
Lishan H
8b991d98a1 [feature] asr: add chunk-based streaming ASR for Qwen3-ASR (#22089) 2026-04-10 01:49:03 +08:00
billishyahao
1df9f4e2f6 [AMD] Add prealloc token env for mori-ep (#22329) 2026-04-09 09:34:35 -07:00
Jonah Bernard
8216b921a1 Add MLX profiling to bench_one_batch.py (#22159) 2026-04-09 20:45:21 +08:00
Liangsheng Yin
9fed58805f [Doc] Clarify SWA HybridSWAPoolConfigurator comments on all-SWA vs hybrid semantics (#22443) 2026-04-09 03:02:16 -07:00
YMbmzy
8a67fb20ea [Speculative] Support penalty for spec v2 overlap scheduling (#22049) 2026-04-09 01:59:04 -07:00
Thomas Wang
628df31d08 [AMD] Use aiter CK layernorm2d for LayerNorm to reduce NSA indexer kernel launches (#22424) 2026-04-09 01:55:29 -07:00
xutizhou
57ffc55fb6 feat: [1/2] [DeepEP] Fuse shared expert into MoE dispatch under EP (#20089)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: AichenF <aichenf@nvidia.com>
2026-04-09 01:48:28 -07:00
Mick
9709192ce9 [diffusion] feat: support FLUX.2-small-decoder (#22414) 2026-04-09 15:53:14 +08:00
Liangsheng Yin
de441ac6bb [core] Introduce MemoryPoolConfigurator class hierarchy (#22389) 2026-04-09 15:29:19 +08:00
Evgueni Petrov
b9c316917b fix AttributeError: 'LazyValue' object has no attribute 'keys' in eplb_manager.py for qwen3 moe (#21822) 2026-04-09 00:13:29 -07:00
Nicolas Castet
e379befbac Add symmetric debug mode to print stack trace of comm ops with unregistered tensors (#18569) 2026-04-08 22:34:58 -07:00
Bingxu Chen
6b96f8341d [AMD] Fix multimodal diffusion test crash on ROCm by falling back to SDPA (#22335) 2026-04-08 22:32:49 -07:00
Mick
355fcbcc17 [diffusion] fix: fix cache dit refresh none mask (#22374) 2026-04-09 11:58:24 +08:00
jsheng_Linkedin
6838a23226 [Feature] Add token embedding overrides for sparse embedding replacement (#20960)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 20:51:36 -07:00
Kurkur
a69be2e866 [Feature] Support eagle3 for qwen3-vl (#22230) 2026-04-09 11:45:36 +08:00
Lianmin Zheng
ddc8ef1038 Lazy import flash_attention_v4 to avoid loading flash_attn.cute at startup (#22306) 2026-04-08 20:40:25 -07:00
Khoa Pham
f127d67823 [Spec][Ngram] Misc enhance support for multiple SAMs (#22294) 2026-04-08 19:56:23 -07:00
Kangrui Du
1b7c33a5b7 [diffusion] rl: revamp rollout Log-Prob support with SDE/CPS for RL post-training (#21204)
Co-authored-by: MikukuOvO <mikukuovo@gmail.com>
2026-04-09 09:00:00 +08:00
Liangsheng Yin
1e3f6ebea6 [core] Extract pool sizing logic to pool_configurator.py (#22384) 2026-04-08 16:13:21 -07:00
Baizhou Zhang
4e5b8cb041 Fix get_version_tag.py to handle dot-separated post versions (#22385)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 15:18:22 -07:00
sglang-bot
df3275bd6c chore: bump flashinfer version to 0.6.7.post3 (#22382)
Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com>
2026-04-08 14:49:45 -07:00
Yufeng He
c89afaea7c Fix hybrid_linear_attn_backend crash with ngram speculation (#20739)
Co-authored-by: kpham-sgl <khoa.pham@radixark.ai>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 12:52:07 -07:00
YAMY
c26b8b4a4b [GDN] Remove FlashInfer GDN decode + no_buffer guard and default to FlashInfer on SM100+ (#21861) 2026-04-08 11:59:15 -07:00
Kurt Shuster
db30a63a13 [sgl-kernel] support > 1024 experts in moe_align_block_size kernel (#21610) 2026-04-08 11:45:13 -07:00
Mick
4ac6fa0d87 [diffusion] fix: fix loading multiple ckpts with different precision for a same module (#22360) 2026-04-09 02:44:19 +08:00
Yihao Wang
a5ed507a16 [refactor] [asr] add transcription adapter for extensible ASR models support (#22181) 2026-04-09 01:19:37 +08:00
Yihao Wang
ae8da14ea3 [fix] [whisper] ensure inputs are moved to the correct device before processing. (#22293) 2026-04-08 23:45:42 +08:00
Xiaoyu Zhang
b5b2dbe05f [Diffusion] Add diffusion NVFP4 scaled-mm correctness test (#22127)
Co-authored-by: Mick <mickjagger19@icloud.com>
2026-04-08 22:07:24 +08:00
zhaozx-cn
33c9cc8994 [NPU] fix qwen3.5 video processor (#22266) 2026-04-08 21:13:29 +08:00
Fergus
413913763f fix: wrap _import_static_state in inference_mode to fix resume on Blackwell (#21035) 2026-04-08 02:03:39 -07:00
Vladislav Nosivskoy
79c82c5c42 [HiCache] Fix write_backup return type when parent not backed up (#22185)
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Co-authored-by: hzh0425 <hzh0425@apache.org>
2026-04-08 16:42:57 +08:00
Sundara Raman Ramachandran
712c8c5051 [Score API] Add SequenceClassification Model support (#22118) 2026-04-08 01:30:58 -07:00
HuangJi
c3c13dd5e3 [diffusion] fix: make warmup image initialization rank-safe (#21817) 2026-04-08 15:51:09 +08:00
Bingxu Chen
de0cfed159 [AMD] Fix DLPack Error in Aiter flydsl GEMM by Detaching MoE Gate Weight (#22262)
Co-authored-by: bingxche <binxche@amd.com>
2026-04-07 23:42:10 -07:00
Артем Савкин
cd373667cd [Bugfix] [NPU] Qwen3.5 with quantization fix (#21692) 2026-04-08 09:15:48 +03:00
Thomas Wang
729b74d8dd [AMD] Fix GLM-5 fp8 KV quant path dispatch on MI300 (#22314) 2026-04-07 21:16:02 -07:00
yuefeng Wu
4e4b4ac153 [NPU] enable index Cache for npu (#21502) 2026-04-08 11:45:17 +08:00
Alex Nails
493ec91cbe [CI] Fix stage-b-test-1-gpu-large (0) timeout by reordering LoRA tests and using tokenizer from cache (#22292)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 20:00:44 -07:00
Liangsheng Yin
1c5c6dad5e [tiny] Fix TOCTOU race in pause-aware weight update locking (#22304)
Co-authored-by: maocheng23 <maocheng@berkeley.edu>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 18:54:28 -07:00
Mick
eca62ab8f4 UX: clean loggings (#22174) 2026-04-08 09:46:38 +08:00
maocheng23
6c2a759a04 [fix] Fix writer lock deadlock in update_weights_from_ipc during pause_generation (#22290)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 18:32:56 -07:00
Trevor Morris
7546d04c81 [NVIDIA] Enable FP4 flashinfer trtllm routed moe (#21240) 2026-04-07 16:16:29 -07:00
Liangsheng Yin
0e2a0260a1 Add fast-fail to multimodal-gen CI (#22284) 2026-04-07 15:56:12 -07:00