Commit Graph

6437 Commits

Author SHA1 Message Date
Xinyuan Tong
3c34d2c3eb [FIX] kimi_k2 reasoning parser (#17901)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2026-01-28 19:47:09 -08:00
Joe Redmond
0ff0d181ca feat: add custom request header logging (#17786) 2026-01-28 19:33:08 -08:00
kk
f1384f5293 Integration mori backend for EP a2a data communication (#17012)
Co-authored-by: Duyi-Wang <duyi.wang@amd.com>
Co-authored-by: billishyahao <bill.he@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
2026-01-28 19:07:34 -08:00
Jerry Ji
673dc09d9b [Fix][trtllm-mha] Canonicalize the strides when num_head = 1 (#17732) 2026-01-29 10:11:18 +08:00
Qi Yuhang
0368ddf9ea [JIT Kernel]Support fused_add_rmsnorm in JIT Kernel (#17677) 2026-01-29 09:29:59 +08:00
Zhang Yiyang (SII)
09a9147f59 [diffusion] model: support MOVA (#17704)
Co-authored-by: gaoyang07 <Gary1546308416AL@gmail.com>
Co-authored-by: cms42 <c@cms42.top>
Co-authored-by: cms42 <44895820+cms42@users.noreply.github.com>
Co-authored-by: Ruixiao Li <cgruixiao@outlook.com>
Co-authored-by: Li Ruixiao(SII) <80368770+Li-dongyang@users.noreply.github.com>
2026-01-29 09:12:08 +08:00
Prozac614
3fcda00e8c [CI] Fix CI timeouts by upgrading runai_model_streamer (related to #16937) (#17636) 2026-01-28 17:09:45 -08:00
Lianmin Zheng
d4180815a4 Make the functions in logits_processor.py and sampler.py more modular (#17885) 2026-01-28 16:24:23 -08:00
jackey hua
0998de088b [Perf] Tune Llama-4-Scout-17B-16E-Instruct fused moe kernel (#17891) 2026-01-28 14:06:46 -08:00
gingerXue
e9d727cb92 [MUSA][7/N] Enhance CUDA / PyNccl wrapper to support MTLink connectivity detection (#17499)
Signed-off-by: jingzhi.xue <jingzhi.xue@mthreads.com>
Co-authored-by: jingzhi.xue <jingzhi.xue@mthreads.com>
2026-01-28 11:36:30 -08:00
Артем Савкин
b77b0ffd60 [NPU] NZ for non-quantized MOE, Qwen3 MOE double memory consumption fix (#15904)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-29 00:55:08 +08:00
Jinn
1953efb60e [AMD] ROCm: route W4A16 MoE to Triton and fix packed-weight loading (#17863) 2026-01-28 08:20:23 -08:00
triple-mu
1d1e72e516 [diffusion] fix: fix comfyui import typo (#17834) 2026-01-28 23:49:55 +08:00
Xiaoyu Zhang
c08b54a575 [JIT kernel] Update jit_kernel cache and develop doc (#17842) 2026-01-28 15:09:47 +08:00
Mick
2573a262af [diffusion] doc: fix wrong docker run command (#17856) 2026-01-28 14:52:33 +08:00
Ziang Li
a8dda2aa57 [DSv32] Overlap indexer qk projection and activation quant (#17688) 2026-01-28 11:46:49 +08:00
Yisheng Gong
1c4616a034 fix: add bias when enable mm fallback variant (#17690) 2026-01-28 09:50:49 +08:00
陈一涵
647428d8d6 [diffusion] perf: apply mul add fusion for Qwen-Image (#16299)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-28 09:40:13 +08:00
Yashika Gandhi - Google
32ea7bcdd8 [diffusion] endpoint: fix vertex generate (#17611) 2026-01-28 09:38:56 +08:00
Mick
88fcd8535f [diffusion] feat: add an arg for controlling the number of prefetched layers in layerwise-offload (#17693) 2026-01-28 09:34:27 +08:00
Mick
1507dc6cdf [diffusion] fix: fix suppressing error log on non-main ranks (#17712) 2026-01-28 09:29:19 +08:00
Xiaoyu Zhang
331a22427c [Diffusion] glm-image apply flashinfer rope (#17689) 2026-01-28 08:51:37 +08:00
siyu
4d00bd17a3 use shared memory for multimodal feature transport between Tokenizer and Scheduler (#16402)
Co-authored-by: Yuhao Yang <47235274+yhyang201@users.noreply.github.com>
2026-01-27 11:01:08 -08:00
Minglei Zhu
d90c0837e5 [hybrid-model] clean up and consolidate redundant fields in RadixLinearAttention (#17660) 2026-01-27 10:37:58 -08:00
fsygd
547e2d037e [diffusion] refactor: add arg to control the precision of dit (#17751) 2026-01-27 23:01:23 +08:00
monkeyLoveding
d578b41bad [NPU] Adapt cann 8.5: use sfa and lightning indexer op from cann and CI update (#17615)
Co-authored-by: Kelon <kelonlu@163.com>
2026-01-27 19:03:53 +08:00
MikkoParkkola
c56d19b977 fix(quantization): add sgl_kernel fallback for FP4 quantize on Blackwell GPUs (#17816) 2026-01-27 18:43:17 +08:00
Xuchun Shang
dba264ac73 [PP] fix wrong weight logic for tie_word_embeddings model (#15890)
Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>
2026-01-27 17:41:17 +08:00
Yuxuan Zhang
7106f6c8e1 [GLM-OCR] Support GLM-OCR Model (#17582)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2026-01-26 22:24:00 -08:00
Taemin Jung
81c0f5c5ad [Model] Add support for EXAONE-4.0 Model (#8205)
Signed-off-by: BoxBy <lute7071@gmail.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2026-01-27 14:08:24 +08:00
laixin
6c9b054ab7 [Bug Fix] Fix reasoning parser when continue_final_message=true (#17065)
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
2026-01-27 14:04:44 +08:00
shuwenn
57e432d951 fix: preserve disconnect events in api key middleware (#17253) 2026-01-26 22:48:24 -05:00
shuwenn
fd3b179ffd [HiCache][HA 1/N] Support HiCache storage runtime attach/detach (#15892) 2026-01-26 19:33:19 -08:00
Zhongdongming Dai
1b56a886bb [chore]: improve time tracing of model loading process (#15426)
Co-authored-by: Michael Shin <mmshin@nvidia.com>
Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
2026-01-26 19:04:25 -08:00
Yuhao Yang
479ab7a4e7 model: support Kimi-K2.5 (#17789)
Co-authored-by: Mick <mickjagger19@icloud.com>
2026-01-27 10:57:00 +08:00
WenhaoZhang
0519b0935f [diffusion] comfyui: support Qwen-Image, Multi-GPU Z-Image, and Enhanced ComfyUI Integration (#17678)
Co-authored-by: niehen6174 <niehen.6174@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-27 10:06:42 +08:00
FlyPanda
2d8c22a15e [bugfix] Internal processing of hf3fs crash # 16614 (#16938) 2026-01-26 18:01:50 -08:00
Mahdi-CV
539924037f fix(processor): support InternS1 text_config in InternVL processor (#17040)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-26 13:02:54 -08:00
ybyang
5ab76ff220 Special logic for healthcheck (#17734)
Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>
2026-01-26 10:26:40 -08:00
Liangsheng Yin
85d077f44d Introduce global alloc_len_per_decode & clean check decode memory (#15115) 2026-01-26 10:26:20 -08:00
Makcum888e
bba6e38ff8 [NPU] Split pyproject npu from pyproject other (#17641) 2026-01-26 09:45:44 -08:00
Yuan Luo
7bb41989fa [1/N] Optimize All Reduce - Benchmark different AR operations (#13797)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2026-01-26 22:44:13 +08:00
lawtherWu
b56366f827 [NPU]DeepSeek-V3.2 support npu mlaprolog (#15381)
Co-authored-by: Zhengda Qin <zhengdqin@gmail.com>
Co-authored-by: richhuan <huan_rz@qq.com>
2026-01-26 20:42:37 +08:00
Yi Zhang
5844cb2fd8 refactor mamba radix cache logic in server_args (#17645) 2026-01-26 17:02:49 +08:00
shaharmor98
f6f1b6d000 Bump FI version (#17700)
Signed-off-by: Shahar Mor <smor@nvidia.com>
Co-authored-by: b8zhong <b8zhong@uwaterloo.ca>
2026-01-26 16:50:06 +08:00
McZyWu
2734b23481 accuracy enhancement for baichuan2-13B for npu (#16868)
Co-authored-by: cy <chenyang08056032@163.com>
2026-01-26 16:14:35 +08:00
Prozac614
12f794e516 [diffusion] fix: fix missing backend argument in pipelines_core initialization (#17343) 2026-01-26 15:47:10 +08:00
Kangyan-Zhou
48f4340b14 Exclude some diffusion package for ARM in docker release (#17745) 2026-01-25 23:32:39 -08:00
Alison Shao
30b3192039 Merge performance/accuracy test suites into regular stage-b suites (#17609) 2026-01-25 22:49:19 -08:00
CSWYF3634076
1a19b3987d [Model] Add Ernie4.5 VL model support (#15679)
Signed-off-by: CSWYF3634076 <wangyafeng@baidu.com>
Signed-off-by: wangyafeng <wangyafeng@baidu.com>
2026-01-25 22:36:29 -08:00