Baizhou Zhang
|
587deb15a7
|
[hotfix] Fix pytest not found in CI (#12311)
|
2025-10-29 11:07:36 +08:00 |
|
Cheng Wan
|
83087247d1
|
[hotfix] missing w13_weight_fp8 and w2_weight_fp8 in UE8M0 requantization (#12259)
|
2025-10-28 19:10:38 -07:00 |
|
Xiaoyu Zhang
|
334543ff3b
|
Add continuous_usage_stats support for streaming responses (#12241)
|
2025-10-29 10:01:23 +08:00 |
|
b8zhong
|
c143f416ce
|
fix: Llama 4 BF16 load on Blackwell (#12308)
|
2025-10-28 18:59:01 -07:00 |
|
fzyzcjy
|
29195aaa6e
|
Super tiny fix expert distribution dump error (#12271)
|
2025-10-28 15:20:55 -07:00 |
|
bmac3
|
8d6ab1cb88
|
fix seqlen bug for trtllm_mla's draft_extend (#12295)
|
2025-10-28 14:47:47 -07:00 |
|
b8zhong
|
77225d602a
|
Use Flashinfer TRT-LLM as Llama 4 compatible MoE backend (#11928)
|
2025-10-28 10:39:43 -07:00 |
|
Trevor Morris
|
fdd00295b5
|
Fix 'BypassedTopKOutput' object has no attribute 'topk_weights' for DeepEP (#12231)
|
2025-10-28 09:28:25 -07:00 |
|
Yineng Zhang
|
64cf868eba
|
chore: cleanup quant deps (#12268)
|
2025-10-28 02:03:57 -07:00 |
|
Yineng Zhang
|
ea39952797
|
Revert "[Feature] PD-Multiplexing Context and Scheduler." (#12267)
|
2025-10-28 02:00:37 -07:00 |
|
Shangming Cai
|
41a113356a
|
Fix potential eos bug on decode instance when PD is enabled (#12206)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
|
2025-10-28 01:29:02 -07:00 |
|
Xuchun Shang
|
a1f2dc90e4
|
[Bug fix] [PP] fix wrong dtype for quantified model (#12247)
Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>
|
2025-10-28 01:27:24 -07:00 |
|
Feng Su
|
ea96106000
|
[Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 2 (#10804)
Signed-off-by: Feng Su <sufeng@linux.alibaba.com>
|
2025-10-28 01:25:46 -07:00 |
|
Cheng Wan
|
b1e13e7cea
|
[hotfix] Incorrect CombineOverlapArgs in SBO (#12230)
|
2025-10-28 01:23:06 -07:00 |
|
Chenxi Li
|
cc7b04a29c
|
Feature/Add GET endpoint to query loaded LoRA adapters (#12229)
|
2025-10-28 01:22:00 -07:00 |
|
fzyzcjy
|
691c8534cf
|
Support releasing CUDA graph memory when paused (#7873)
Co-authored-by: ryang-max <y1cunhui.yang@gmail.com>
Co-authored-by: ryang <38470282+ryang-max@users.noreply.github.com>
|
2025-10-28 14:40:50 +08:00 |
|
Yongfei Xu
|
d2b8c4123e
|
Opt fused triton moe: add tma for down proj kernel (#10567)
Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com>
|
2025-10-28 14:26:17 +08:00 |
|
Scott Lee
|
bf8f7a944f
|
Add per-request retraction count (#11177)
|
2025-10-27 23:22:34 -07:00 |
|
hlu1
|
81a632ace6
|
[DeepseekV32] Enable flashmla_prefill kernel with fp8 kvcache (#11655)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
|
2025-10-27 23:11:48 -07:00 |
|
ishandhanani
|
285a8e6986
|
docker: add CUDA13 support in dockerfile and update GDRCopy/NVSHMEM for blackwell support (#11517)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
|
2025-10-27 22:00:54 -07:00 |
|
Yuan Luo
|
813bd6f85c
|
[2/2] Use moe_sum_reduce cuda kernel (#10654)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com>
|
2025-10-28 12:01:57 +08:00 |
|
Xinyuan Tong
|
729f612dc6
|
Update openai package version to 2.6.1 (#12222)
|
2025-10-28 11:23:40 +08:00 |
|
jianan-gu
|
899453ac50
|
Use explicit uint64 dtype for Tensor data_ptr() to avoid overflow (#11994)
|
2025-10-27 19:05:57 -07:00 |
|
Lifu Huang
|
ce832d7034
|
Add env var to control custom Triton kernel cache and set CSGMV as default backend. (#12176)
|
2025-10-27 17:49:32 -07:00 |
|
weiliang
|
88596739a4
|
Support running FP4 Deepseek on SM120. (#11708)
|
2025-10-27 17:37:49 -07:00 |
|
Yineng Zhang
|
a6ea3add76
|
[Auto Sync] Update scheduler.py, spec_info.py, run_suite.py... (20251027) (#12235)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: gongwei-130 <56567052+gongwei-130@users.noreply.github.com>
|
2025-10-27 17:21:08 -07:00 |
|
fzyzcjy
|
326c84c493
|
Compiling rope while preserving true on policy (#12161)
|
2025-10-28 08:02:17 +08:00 |
|
gongwei-130
|
8da608cce0
|
fix: AttributeError: 'NixlKVManager' object has no attribute 'prefill_tp_size_table' (#12234)
|
2025-10-27 15:57:33 -07:00 |
|
satyamk7054
|
9fc3e8aac7
|
Add support for Matryoshka embeddings (#126) (#11142)
Co-authored-by: Satyam Kumar <satyamk@linkedin.com>
|
2025-10-28 02:49:36 +08:00 |
|
Chunyuan WU
|
c11b34d599
|
rope xpu: fix missing argument 'fused_set_kv_buffer_arg' and replace native with sgl_kernel_xpu impl (#12006)
|
2025-10-28 01:02:18 +08:00 |
|
ykcombat
|
05ad28f25e
|
[Feature] PD-Multiplexing Context and Scheduler. (#11592)
|
2025-10-28 00:54:43 +08:00 |
|
pansicheng
|
0cae873fcd
|
check_offload_progress more frequently (#11656)
|
2025-10-28 00:37:38 +08:00 |
|
Haichao Zhu
|
a8b91f6b2d
|
improve mimax-m2 rmsnorm precision (#12186)
|
2025-10-28 00:01:42 +08:00 |
|
Jimmy
|
959d1ab84b
|
fix(metrics): double times add_latency for DECODE_BOOTSTRAP (#12209)
|
2025-10-27 23:59:48 +08:00 |
|
Muqi Li
|
6c1c193308
|
[Detokenizer Manager] Cleanup state when reqs are finished (#12205)
|
2025-10-27 23:59:09 +08:00 |
|
cctry
|
3029d30189
|
Fix crash after flush cache (#12107)
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
|
2025-10-27 23:52:27 +08:00 |
|
Yuan Luo
|
f389f01714
|
Optimize triton_mrope with torch compile (#12112)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
|
2025-10-27 23:49:22 +08:00 |
|
Weiwei
|
caa4819bfc
|
Add support for AutoRound quantized models (#10153)
|
2025-10-27 18:17:29 +08:00 |
|
Yuxuan Zhang
|
a88b006ecf
|
GLM-4-0414 and GLM-4.1V Code Refactor (#12117)
|
2025-10-27 16:57:07 +08:00 |
|
sglang-bot
|
55d75e11bd
|
chore: bump SGLang version to 0.5.4.post1 (#12169)
|
2025-10-27 09:35:20 +08:00 |
|
Chang Su
|
94aad0de99
|
[misc][grpc] Remove duplicate log (#12168)
|
2025-10-26 14:06:59 -07:00 |
|
赵晨阳
|
7ebc28f5d6
|
[WIP] support MiniMax M2 model (#12129)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Signed-off-by: xuebi <xuebi@minimaxi.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: Roger Young <42564206+rogeryoungh@users.noreply.github.com>
Co-authored-by: xuebi <xuebi@minimaxi.com>
|
2025-10-26 13:58:54 -07:00 |
|
ash-sigh
|
0b3b3e9a69
|
transfer mrope_position_delta to device when first running (#11047)
|
2025-10-26 13:06:09 -07:00 |
|
Elfie Guo
|
a1d5bc4cce
|
Avoid using flashinfer_allreduce_fusion when dp attention is enabled. (#11632)
|
2025-10-26 12:31:14 -07:00 |
|
Zijian Zhang
|
a8023891f6
|
model: support NVILA and NVILA Lite (#10399)
|
2025-10-26 09:58:09 -07:00 |
|
fzyzcjy
|
0103f374ba
|
Support DeepGEMM for deterministic inference (#12142)
|
2025-10-26 22:36:17 +08:00 |
|
zyksir
|
96a5a949f6
|
[Fix] fix allreduce bug in Piecewise Graph (#12106)
|
2025-10-26 21:15:48 +08:00 |
|
Liangsheng Yin
|
ea385ae85a
|
Fix ITL metrics when using openai endpoint with spec (#12156)
|
2025-10-26 18:06:25 +08:00 |
|
Kai-Hsun Chen
|
6371f7af27
|
[quantization] AWQ Marlin doesn't work when dtype is bfloat16 (#11494)
Signed-off-by: Kai-Hsun Chen <khchen@x.ai>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
|
2025-10-26 15:49:45 +08:00 |
|
Liangsheng Yin
|
8491c794ad
|
[misc] depdencies & enviroment flag (#12113)
|
2025-10-26 14:52:35 +08:00 |
|