Commit Graph

4223 Commits

Author SHA1 Message Date
Baizhou Zhang
587deb15a7 [hotfix] Fix pytest not found in CI (#12311) 2025-10-29 11:07:36 +08:00
Cheng Wan
83087247d1 [hotfix] missing w13_weight_fp8 and w2_weight_fp8 in UE8M0 requantization (#12259) 2025-10-28 19:10:38 -07:00
Xiaoyu Zhang
334543ff3b Add continuous_usage_stats support for streaming responses (#12241) 2025-10-29 10:01:23 +08:00
b8zhong
c143f416ce fix: Llama 4 BF16 load on Blackwell (#12308) 2025-10-28 18:59:01 -07:00
fzyzcjy
29195aaa6e Super tiny fix expert distribution dump error (#12271) 2025-10-28 15:20:55 -07:00
bmac3
8d6ab1cb88 fix seqlen bug for trtllm_mla's draft_extend (#12295) 2025-10-28 14:47:47 -07:00
b8zhong
77225d602a Use Flashinfer TRT-LLM as Llama 4 compatible MoE backend (#11928) 2025-10-28 10:39:43 -07:00
Trevor Morris
fdd00295b5 Fix 'BypassedTopKOutput' object has no attribute 'topk_weights' for DeepEP (#12231) 2025-10-28 09:28:25 -07:00
Yineng Zhang
64cf868eba chore: cleanup quant deps (#12268) 2025-10-28 02:03:57 -07:00
Yineng Zhang
ea39952797 Revert "[Feature] PD-Multiplexing Context and Scheduler." (#12267) 2025-10-28 02:00:37 -07:00
Shangming Cai
41a113356a Fix potential eos bug on decode instance when PD is enabled (#12206)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
2025-10-28 01:29:02 -07:00
Xuchun Shang
a1f2dc90e4 [Bug fix] [PP] fix wrong dtype for quantified model (#12247)
Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>
2025-10-28 01:27:24 -07:00
Feng Su
ea96106000 [Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 2 (#10804)
Signed-off-by: Feng Su <sufeng@linux.alibaba.com>
2025-10-28 01:25:46 -07:00
Cheng Wan
b1e13e7cea [hotfix] Incorrect CombineOverlapArgs in SBO (#12230) 2025-10-28 01:23:06 -07:00
Chenxi Li
cc7b04a29c Feature/Add GET endpoint to query loaded LoRA adapters (#12229) 2025-10-28 01:22:00 -07:00
fzyzcjy
691c8534cf Support releasing CUDA graph memory when paused (#7873)
Co-authored-by: ryang-max <y1cunhui.yang@gmail.com>
Co-authored-by: ryang <38470282+ryang-max@users.noreply.github.com>
2025-10-28 14:40:50 +08:00
Yongfei Xu
d2b8c4123e Opt fused triton moe: add tma for down proj kernel (#10567)
Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com>
2025-10-28 14:26:17 +08:00
Scott Lee
bf8f7a944f Add per-request retraction count (#11177) 2025-10-27 23:22:34 -07:00
hlu1
81a632ace6 [DeepseekV32] Enable flashmla_prefill kernel with fp8 kvcache (#11655)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
2025-10-27 23:11:48 -07:00
ishandhanani
285a8e6986 docker: add CUDA13 support in dockerfile and update GDRCopy/NVSHMEM for blackwell support (#11517)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2025-10-27 22:00:54 -07:00
Yuan Luo
813bd6f85c [2/2] Use moe_sum_reduce cuda kernel (#10654)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com>
2025-10-28 12:01:57 +08:00
Xinyuan Tong
729f612dc6 Update openai package version to 2.6.1 (#12222) 2025-10-28 11:23:40 +08:00
jianan-gu
899453ac50 Use explicit uint64 dtype for Tensor data_ptr() to avoid overflow (#11994) 2025-10-27 19:05:57 -07:00
Lifu Huang
ce832d7034 Add env var to control custom Triton kernel cache and set CSGMV as default backend. (#12176) 2025-10-27 17:49:32 -07:00
weiliang
88596739a4 Support running FP4 Deepseek on SM120. (#11708) 2025-10-27 17:37:49 -07:00
Yineng Zhang
a6ea3add76 [Auto Sync] Update scheduler.py, spec_info.py, run_suite.py... (20251027) (#12235)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: gongwei-130 <56567052+gongwei-130@users.noreply.github.com>
2025-10-27 17:21:08 -07:00
fzyzcjy
326c84c493 Compiling rope while preserving true on policy (#12161) 2025-10-28 08:02:17 +08:00
gongwei-130
8da608cce0 fix: AttributeError: 'NixlKVManager' object has no attribute 'prefill_tp_size_table' (#12234) 2025-10-27 15:57:33 -07:00
satyamk7054
9fc3e8aac7 Add support for Matryoshka embeddings (#126) (#11142)
Co-authored-by: Satyam Kumar <satyamk@linkedin.com>
2025-10-28 02:49:36 +08:00
Chunyuan WU
c11b34d599 rope xpu: fix missing argument 'fused_set_kv_buffer_arg' and replace native with sgl_kernel_xpu impl (#12006) 2025-10-28 01:02:18 +08:00
ykcombat
05ad28f25e [Feature] PD-Multiplexing Context and Scheduler. (#11592) 2025-10-28 00:54:43 +08:00
pansicheng
0cae873fcd check_offload_progress more frequently (#11656) 2025-10-28 00:37:38 +08:00
Haichao Zhu
a8b91f6b2d improve mimax-m2 rmsnorm precision (#12186) 2025-10-28 00:01:42 +08:00
Jimmy
959d1ab84b fix(metrics): double times add_latency for DECODE_BOOTSTRAP (#12209) 2025-10-27 23:59:48 +08:00
Muqi Li
6c1c193308 [Detokenizer Manager] Cleanup state when reqs are finished (#12205) 2025-10-27 23:59:09 +08:00
cctry
3029d30189 Fix crash after flush cache (#12107)
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
2025-10-27 23:52:27 +08:00
Yuan Luo
f389f01714 Optimize triton_mrope with torch compile (#12112)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-10-27 23:49:22 +08:00
Weiwei
caa4819bfc Add support for AutoRound quantized models (#10153) 2025-10-27 18:17:29 +08:00
Yuxuan Zhang
a88b006ecf GLM-4-0414 and GLM-4.1V Code Refactor (#12117) 2025-10-27 16:57:07 +08:00
sglang-bot
55d75e11bd chore: bump SGLang version to 0.5.4.post1 (#12169) 2025-10-27 09:35:20 +08:00
Chang Su
94aad0de99 [misc][grpc] Remove duplicate log (#12168) 2025-10-26 14:06:59 -07:00
赵晨阳
7ebc28f5d6 [WIP] support MiniMax M2 model (#12129)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Signed-off-by: xuebi <xuebi@minimaxi.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: Roger Young <42564206+rogeryoungh@users.noreply.github.com>
Co-authored-by: xuebi <xuebi@minimaxi.com>
2025-10-26 13:58:54 -07:00
ash-sigh
0b3b3e9a69 transfer mrope_position_delta to device when first running (#11047) 2025-10-26 13:06:09 -07:00
Elfie Guo
a1d5bc4cce Avoid using flashinfer_allreduce_fusion when dp attention is enabled. (#11632) 2025-10-26 12:31:14 -07:00
Zijian Zhang
a8023891f6 model: support NVILA and NVILA Lite (#10399) 2025-10-26 09:58:09 -07:00
fzyzcjy
0103f374ba Support DeepGEMM for deterministic inference (#12142) 2025-10-26 22:36:17 +08:00
zyksir
96a5a949f6 [Fix] fix allreduce bug in Piecewise Graph (#12106) 2025-10-26 21:15:48 +08:00
Liangsheng Yin
ea385ae85a Fix ITL metrics when using openai endpoint with spec (#12156) 2025-10-26 18:06:25 +08:00
Kai-Hsun Chen
6371f7af27 [quantization] AWQ Marlin doesn't work when dtype is bfloat16 (#11494)
Signed-off-by: Kai-Hsun Chen <khchen@x.ai>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
2025-10-26 15:49:45 +08:00
Liangsheng Yin
8491c794ad [misc] depdencies & enviroment flag (#12113) 2025-10-26 14:52:35 +08:00