Commit Graph

7855 Commits

Author SHA1 Message Date
Thomas Wang
671fe73961 Reduce unnecessary kernels and copies in the NSA indexer (#22232) 2026-04-07 15:37:08 -07:00
David Wang
f08726fd56 [Feature] Add DFLASH speculative decoding support (#22077)
Co-authored-by: Jian Chen <141193260+jianc99@users.noreply.github.com>
Co-authored-by: Zhijian Liu <5782437+zhijian-liu@users.noreply.github.com>
Co-authored-by: Richard Gong <8001209+gongy@users.noreply.github.com>
Co-authored-by: David Wang <21328423+dcw02@users.noreply.github.com>
Co-authored-by: yilian49 <43861414+yilian49@users.noreply.github.com>
Co-authored-by: xm:D <38322020+xiaomin-d@users.noreply.github.com>
2026-04-07 14:48:51 -07:00
Liangsheng Yin
cc35714b03 [tiny] migrate /get_server_info; print accept length in accuracy tests (#22282) 2026-04-07 13:08:35 -07:00
Rain Jiang
1a8eb890f6 Kernels community fa3 (#20796) 2026-04-07 12:48:44 -07:00
huangtingwei
0c204fbd57 [HiSparse] Optimize the scheduling of decode backup. (#21932)
Co-authored-by: hzh0425 <hzh0425@apache.org>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2026-04-07 10:34:58 -07:00
khalilzhk
6131fb5882 [NPU] enable mla prepare fused kernel only when being mla attn (#22024) 2026-04-08 00:49:16 +08:00
Ke Bao
be42fbbbd7 Support HTTP2 server (#21700) 2026-04-08 00:42:52 +08:00
shuwenn
ec5742f4ab fix: Auto-correct page_size for Mamba no_buffer radix cache mode (#20538) 2026-04-08 00:19:31 +08:00
Henson-Zh-Ali
727a182067 [Mamba] eliminate D2H if tracking mamba states (#20522)
Co-authored-by: hzh0425 <hzh0425@apache.org>
2026-04-08 00:17:26 +08:00
YAMY
5ae00ecd48 [Disagg][NIXL] Support Mamba state slice transfer for heterogeneous TP (Step 2/2 for Qwen3.5) (#22240) 2026-04-07 23:47:31 +08:00
Mick
e7bc23cdab [diffusion] CI: fix consistency check (#22251) 2026-04-07 23:43:18 +08:00
Yujun Dong
233f3e31bf fix(pcg,mm): fix zeroing of input_embeds when replay PCG (#22229) 2026-04-07 20:33:17 +08:00
Xingyu Liu
98f38b14df Add registration API for external linear attention backend (#21983)
Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com>
2026-04-07 02:47:40 -07:00
Nicolas Castet
490fa9fa44 [Perf] Restore torch.compile fusion for topk postprocessing (#21771) 2026-04-07 01:38:38 -07:00
YAMY
3148742ddb [Disagg][NIXL] Fix heterogeneous TP KV transfer for non-MLA models (same logic with mooncake, Step 1/2 for Qwen3.5 support) (#22145) 2026-04-07 14:52:02 +08:00
amote-i
3f7dfba419 fix qwen2_5_math_rm_72b (#21295) 2026-04-07 14:36:57 +08:00
Aditya Sharma
f6e85676b5 model: support qwen3-asr (#22073)
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
2026-04-07 13:27:05 +08:00
Chang Min Bark
a757c1e3fb [Apple Silicon] [MLX] Add mlx and mlx-lm dependencies (#22162)
Co-authored-by: R0CKSTAR <yeahdongcn@gmail.com>
2026-04-07 11:36:43 +08:00
Xinyuan Tong
2813cb6d9a [New Model] Gemma 4 (#21952)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Pengyu Chen <pychen96@gmail.com>
Co-authored-by: kpham-sgl <khoa.pham@radixark.ai>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Andy Luo <andy.luo@amd.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: adarshxs <adarsh.shirawalmath@gmail.com>
2026-04-06 20:24:44 -07:00
jianzhao-xu
73fc87a74f fix(grok): adapt huihui-ai/grok-2 (#21522)
Co-authored-by: Jianzhao Xu <xujianchao@huawei.com>
2026-04-07 10:04:41 +08:00
Lianmin Zheng
494bb86169 Cache sub-objects in __getitem__ to ensure identity stability (#22184)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 18:53:38 -07:00
Prozac614
ef2d4013d7 [diffusion] CI: add consistency test (#15236)
Co-authored-by: daiweitao <dwti614707404@163.com>
2026-04-07 09:50:23 +08:00
Liangsheng Yin
e4b1366a46 [Spec][Ngram] Support multiple SAMs with dynamic HTTP API (#22203) 2026-04-06 18:49:22 -07:00
Liangsheng Yin
49cb7d546e Move hash utils out of hicache_storage to break CUDA import chain (#22214) 2026-04-06 18:16:40 -07:00
AichenF
5e2b0f860c [diffusion] perf: replace Conv3d with reshape + F.linear in PatchEmbed (#21014) 2026-04-07 09:12:59 +08:00
shadowxz109
ae38b24cc3 [NPU] Support dp-attention for MiniMax2.5 (#20919) 2026-04-07 08:55:37 +08:00
Trevor Morris
5cc246e095 Fix extra calls to get_numa_node_if_available to clean up logs (#21781) 2026-04-06 16:18:40 -07:00
Lianmin Zheng
a80961333b Clean up req_time_stats: reduce overhead and simplify (#22186) 2026-04-06 14:20:51 -07:00
Qiaolin Yu
93f38fe410 tiny fix chain-style multi layer eagle comments (#22206) 2026-04-06 13:49:03 -07:00
Tarushii Goel
8f337682bd [sgl] potential chained spec v2 fixes (#22041)
Co-authored-by: Mook <Godmook@users.noreply.github.com>
Co-authored-by: yudian0504 <yudian0504@users.noreply.github.com>
2026-04-06 13:38:04 -07:00
Ratish P
7f2fcc0b08 [VLM]: allow Qwen3.5 models for encoder disaggregation (#21849) 2026-04-07 02:07:24 +08:00
Aurick Qiao
3178f3959f Align incremental streaming logprobs with streamed output tokens (#21583) 2026-04-06 00:30:02 -07:00
Khoa Pham
12272b6791 [Spec][Ngram] 6/N: Load an external corpus and construct a Suffix Automaton (#21425) 2026-04-06 00:11:14 -07:00
Liangsheng Yin
6de2ff2a80 [Spec][Ngram] Followup fixes for MatchState incremental advance (#22180) 2026-04-05 23:04:28 -07:00
YAMY
dc125afffb Add staging buffer CI test and documentation for heterogeneous TP (#21921)
Co-authored-by: Shangming Cai <csmthu@gmail.com>
2026-04-06 14:00:20 +08:00
Khoa Pham
b2008bf9e0 [Spec][Ngram] 5/N: Store and advance anchor match state across decode steps (#21243) 2026-04-05 22:21:05 -07:00
Mick
82c41a2d9e [diffusion] model: support LTX2.3 (#22111) 2026-04-06 12:26:30 +08:00
Qiaolin Yu
f407461ec8 Tiny fix trtllm_fp8_per_tensor_scale_moe_wrapper router_logits dtype (#22006) 2026-04-05 21:11:45 -07:00
Lianmin Zheng
e835601fb7 Cache gfx95 quant format detection in DeepseekV2DecoderLayer (#22143) 2026-04-05 20:20:54 -07:00
Bi Xue
52801ff20c [sgl] two potential spec_v2 bug fixes (#21589)
Co-authored-by: yilian49 <yilian49@users.noreply.github.com>
2026-04-05 19:41:43 -07:00
Prozac614
2f00e42555 [diffusion] CI: apply diffusers backend in lora case (#22157)
Co-authored-by: daiweitao <dwti614707404@163.com>
2026-04-06 10:14:35 +08:00
Zhiqiang Xie
41c7c97ff3 fix hisparse LRU policy (#22170)
Co-authored-by: huangtingwei9988 <huangtingwei9988@users.noreply.github.com>
Co-authored-by: hzh0425 <hzh0425@users.noreply.github.com>
2026-04-05 18:47:58 -07:00
Kangyan-Zhou
93109cc89b [Fix] Fix setuptools-scm version resolution for rc tags (#22165)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2026-04-05 16:55:32 -07:00
Zhiqiang Xie
30ba1f78b0 Hisparse Minor Fix (#22131)
Co-authored-by: huangtingwei9988 <141888744+huangtingwei9988@users.noreply.github.com>
Co-authored-by: hzh0425 <58988019+hzh0425@users.noreply.github.com>
2026-04-05 16:15:47 -07:00
Kangyan-Zhou
5dd2c243eb fix: TRT-LLM MHA CUDA illegal address with EAGLE v2 + DP attention (#21649)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2026-04-05 09:41:14 -07:00
Baizhou Zhang
c5fa364b80 [Hotfix] Fix router gemm on sm103 (#22134) 2026-04-05 09:33:14 -07:00
Zhangheng
51b276de74 [BugFix][RadixTree]: Fix backup invariant violation in Hi-MambaRadixTree (#22062)
Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
Co-authored-by: linjianyu77@foxmail.com
2026-04-05 23:19:50 +08:00
Shangming Cai
dccb11881f [PD] Fix staging warmup for GQA prefill decode different tp (#22153) 2026-04-05 23:13:06 +08:00
Liangsheng Yin
df9c831ab8 Unify think_end_id to model_config as single source of truth (#22148) 2026-04-05 03:35:38 -07:00
Liangsheng Yin
aeff9fb7c1 Add dump_metric to MMMU, lm-eval, and NeMo Skills eval paths (#22147) 2026-04-05 03:23:52 -07:00