Kangyan-Zhou
|
b6055e59cd
|
[HiCache] Reduce per-request backup log noise (#20813)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
2026-03-17 22:47:14 -07:00 |
|
Viacheslav
|
30a35ecd90
|
Add gigachat3.1 parser (#19886)
Signed-off-by: Viacheslav Barinov <vvadbarinov@sberbank.ru>
Signed-off-by: Viacheslav Bv <viacheslav.teh@gmail.com>
Co-authored-by: Viacheslav Barinov <vvadbarinov@sberbank.ru>
|
2026-03-17 22:45:01 -07:00 |
|
Evgueni Petrov
|
2e860233ca
|
rocm: fix oom when loading fp8 weights close to size of available vram (#19941)
|
2026-03-17 22:44:19 -07:00 |
|
shiyu7
|
0acc1d3c9a
|
fix: change qwen 3.5 linear attention a_log to fp32 (#19961)
Co-authored-by: sunqi.7 <sunqi.7@bytedance.com>
|
2026-03-17 22:42:06 -07:00 |
|
Brayden Zhong
|
88c40ec16d
|
Use Flashinfer for target_verify in GDN model for SM120 (#20604)
|
2026-03-17 22:40:56 -07:00 |
|
Brayden Zhong
|
97d5386a21
|
Use TRTLLM allreduce fusion for Qwen 3.5 (#19889)
|
2026-03-17 22:40:22 -07:00 |
|
Yuan Luo
|
9c87e137ee
|
[GDN] Support GDN packed decode (#20627)
|
2026-03-18 13:20:07 +08:00 |
|
Kaixi Hou
|
4cc19862ef
|
[NVIDIA] Integrate FlashInfer decode kernel (Blackwell) for Qwen3.5 (#19150)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-03-18 13:11:18 +08:00 |
|
hzh0425
|
c43d495dd5
|
[RadixTree][9/N Refactor]: Support unified init_load_back params (#20590)
|
2026-03-18 11:19:52 +08:00 |
|
Mick
|
f15b3338c9
|
Revert "[Bugfix] Fix GLM-4.6V vision regression in glm4v_moe and glm_ocr" (#20740)
|
2026-03-18 10:09:50 +08:00 |
|
lviy
|
944355c66f
|
[Bugfix] Fix model output corruption caused by EPLB rebalance (Eager and CUDA Graph modes) (#18213)
Co-authored-by: FortPercent <49947620+FortPercent@users.noreply.github.com>
|
2026-03-17 18:30:24 -07:00 |
|
Liangsheng Yin
|
4d3976b6c5
|
[HiCache] Check in-flight async ops in is_fully_idle() before attach/detach (#20746)
|
2026-03-17 17:28:26 -07:00 |
|
Qiaolin Yu
|
c5d2528bff
|
Revert "[AMD][MORI] Fix MTP crash with FP4/FP8 dispatch and add NEXTN dispatch env vars." (#20797)
|
2026-03-17 17:28:09 -07:00 |
|
Shangming Cai
|
2acb20f53b
|
[Disagg] Non-blocking try_ensure_parallel_info in pending queue, consolidate rank mapping into PrefillServerInfo (#20785)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
|
2026-03-17 17:26:18 -07:00 |
|
Rain Jiang
|
cb1e63aba4
|
bump fa4 to official released fa4 pkg (#20303)
|
2026-03-17 17:22:56 -07:00 |
|
Jincong Chen
|
c77d7c629e
|
[Bugfix] Fix MTP prefill cuda graph logging (#20279)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2026-03-17 16:36:52 -07:00 |
|
Kaixi
|
744b1c9e6f
|
Added fallback to individual copy_ (#20683)
|
2026-03-17 14:44:38 -07:00 |
|
Kangyan-Zhou
|
3d8fc9a0ca
|
Revert "[Nvidia] Add trtllm mnnvl allreduce with unified flashinfer allreduce fusion api" (#20792)
|
2026-03-17 11:59:02 -07:00 |
|
Артем Савкин
|
09f5097fe4
|
[NPU] [Bugfix] [diffusion] Fix NZ performance bug for diffusion models (#20684)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2026-03-17 21:23:09 +03:00 |
|
Shu Wang
|
d35fea1b2b
|
[Nvidia] Add trtllm mnnvl allreduce with unified flashinfer allreduce fusion api (#12787)
|
2026-03-17 10:02:45 -07:00 |
|
Yongfei Xu
|
17031120b8
|
[DeepSeek v3.2][Bugfix] get_index_k_scale_buffer support cp (#18280)
|
2026-03-17 09:54:54 -07:00 |
|
Serge Panev
|
466ff20e51
|
[Model] Fix NemotronH OOM on unified-mem systems: stream weights + safetensors cleanup (#20580)
Signed-off-by: Serge Panev <spanev@nvidia.com>
|
2026-03-17 09:47:58 -07:00 |
|
Yuhao Yang
|
24a27d5320
|
vlm: support piecewise cuda graph for Kimi-K2.5 (#20747)
|
2026-03-18 00:32:07 +08:00 |
|
heziiop
|
b5f3eaecbc
|
[NPU] Support dequant_swiglu_quant & moe_init_routing_v2 & npu_moe_token_unpermute for W8A8 MoE decode (#19913)
|
2026-03-17 21:39:29 +08:00 |
|
Mick
|
5717834f1f
|
[diffusion] refactor: cleanup parallel_state.py (#20760)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2026-03-17 21:21:42 +08:00 |
|
Shangming Cai
|
17c81a3e07
|
Revert "[PD] Make pending reqs resolving more robust" (#20779)
|
2026-03-17 20:31:12 +08:00 |
|
YAMY
|
cfead25bbf
|
[Qwen3.5] mamba slice fix (Prefill TP != Decode TP & decode TP size>1) (#20655)
Co-authored-by: Shangming Cai <csmthu@gmail.com>
|
2026-03-17 19:30:58 +08:00 |
|
AMD-yanfeiwang
|
966ae87d02
|
[AMD] avoid correction_bias_dtype dtype convert (#20692)
|
2026-03-17 02:55:05 -07:00 |
|
Liangsheng Yin
|
5270a06488
|
[Disagg] Fix health check false-positive in disagg is_fully_idle (#20756)
|
2026-03-17 17:18:54 +08:00 |
|
Duyi-Wang
|
385a35bd11
|
[AMD][MORI] Fix MTP crash with FP4/FP8 dispatch and add NEXTN dispatch env vars. (#20647)
|
2026-03-17 01:13:42 -07:00 |
|
Junhao Liu
|
ee106757df
|
[diffusion] fix: fix Diffusers backend ignores model-specific sampling parameter (#20080)
Co-authored-by: Mick <mickjagger19@icloud.com>
|
2026-03-17 16:10:46 +08:00 |
|
akhilg-nv
|
9a697ceabb
|
[Fix #20389] Illegal memory access in triton attention for large token counts (#20390)
|
2026-03-17 00:42:11 -07:00 |
|
Ratish P
|
e3277b3be2
|
[diffusion]: remove stale offload-manager in LTX2 AV denoising (#20624)
|
2026-03-17 15:14:00 +08:00 |
|
DefTruth
|
025691cd9e
|
[diffusion] chore: bump up cache-dit & support quant for diffusers backend (#20361)
|
2026-03-17 12:51:31 +08:00 |
|
Rocky Song
|
079a1fd35e
|
[Bugfix] Fix write-through events not processed when scheduler is idle (#20560)
|
2026-03-16 21:49:59 -07:00 |
|
Shangming Cai
|
5d5c31c6e4
|
[PP] Add CP pyobj broadcasting when enable dynamic CPP (#20738)
|
2026-03-17 12:20:11 +08:00 |
|
MMuzzammil1
|
855ec7017d
|
Add check to provide hicache-storage-backend when enabling kv caching on Decode Side in PD Disaggregation (#20732)
Signed-off-by: Mohd Muzzammil <me.muzzammil@samsung.com>
|
2026-03-17 11:25:14 +08:00 |
|
Hubert Lu
|
943f34f642
|
Add NCCL/RCCL pre-warming to reduce P99 TTFT cold-start latency (#20477)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
|
2026-03-16 20:23:14 -07:00 |
|
Jay Shaik
|
e4d06b3db2
|
Fix /generate JSON serialization for non-finite top_logprobs (#20714)
|
2026-03-16 20:07:12 -07:00 |
|
shuwenn
|
515b3a323d
|
feat: support human-readable suffixes (25.6k, 1M, 1Mi) for token CLI (#20577)
|
2026-03-16 20:05:33 -07:00 |
|
psaab
|
9f56b471aa
|
[Network] Use NetworkAddress for dist_init_method and loopback fallbacks (#20657)
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
|
2026-03-16 19:59:49 -07:00 |
|
Jason Yao
|
4dbec2dd2b
|
[typo] Fix typos in comments and log messages in common.py (#20723)
|
2026-03-16 19:26:59 -07:00 |
|
Qiaolin Yu
|
7d87a6a071
|
Fix spec v1 token_ids_logprobs (#20718)
|
2026-03-16 19:23:28 -07:00 |
|
Mick
|
474a851ae3
|
[diffusion] fix: fix sampling params incorrectly override in cli (#20689)
|
2026-03-17 08:48:10 +08:00 |
|
Mick
|
1eea744855
|
[diffusion] CI: enable UT (#20690)
|
2026-03-17 07:44:04 +08:00 |
|
roikoren755
|
5ef5806160
|
[Nemotron] Small reasoning parser fix (#20284)
|
2026-03-16 13:29:40 -07:00 |
|
Bruce Wu
|
70a6fb53af
|
Enable embedding lookup/lora_a logic for chunked backend (#17692)
Co-authored-by: Bruce Wu <mogicianwu@fb.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com>
|
2026-03-16 11:37:58 -07:00 |
|
Douglas Yang
|
061ec582bf
|
fix: adding teacache.params back to sampling params as intended (#20665)
|
2026-03-16 11:27:06 -07:00 |
|
ybyang
|
289cbcf482
|
fix: support PP2+CP8+TP8 (PP with context parallelism) (#19548)
|
2026-03-16 16:51:47 +00:00 |
|
Xiaoyu Zhang
|
6489f77733
|
[Diffusion] Fix compile graph broken by flashinfer rope (#20699)
|
2026-03-16 23:14:27 +08:00 |
|