Commit Graph

458 Commits

Author SHA1 Message Date
shuwenn
b65799cf83 [SPEC][1/N] feat: add adaptive speculative_num_steps for EAGLE topk=1 (#21599)
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
2026-04-20 14:25:04 -07:00
Cheng Wan
5f7aee726a refactor(moe): de-duplicate triton MoE runner path into shared helpers (#23019)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 17:05:13 -07:00
Hubert Lu
edaa5973d4 [AMD][No-Merge] Simplify fused allreduce + RMSNorm and remove hidden_dim allowlist (#21986)
Co-authored-by: HAI <hixiao@gmail.com>
2026-04-11 23:47:08 -07:00
satyamk7054
059b287e25 Add offline auto-tuning for LoRA CSGMV kernel (#20391)
Co-authored-by: Satyam Kumar <satyamk@linkedin.com>
2026-04-10 13:10:43 -07:00
Aditya Sharma
f6e85676b5 model: support qwen3-asr (#22073)
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
2026-04-07 13:27:05 +08:00
Xinyuan Tong
2813cb6d9a [New Model] Gemma 4 (#21952)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Pengyu Chen <pychen96@gmail.com>
Co-authored-by: kpham-sgl <khoa.pham@radixark.ai>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Andy Luo <andy.luo@amd.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: adarshxs <adarsh.shirawalmath@gmail.com>
2026-04-06 20:24:44 -07:00
Xiaoyu Zhang
f3f7711dac Fix Python 3.11 f-string lint error in deepgemm Blackwell benchmark (#22108) 2026-04-04 21:15:22 +08:00
harrisonlimh
9fa12d605a Add dsv3 router gemm benchmark on blackwell (#17707) 2026-04-04 01:18:01 -07:00
Xiaoyu Zhang
ee9d922f5a Revert "[Kernel] Fuse temperature + softmax in sampling for decode speedup" (#22046) 2026-04-03 21:32:08 +08:00
Mook
7a59e05dd1 [Kernel] Fuse temperature + softmax in sampling for decode speedup (#20501) 2026-04-02 12:46:36 +08:00
Yuan Luo
03a87068ea [KDA] Fuse scaled_dot_kkt + solve_tril + recompute_w_u for KDA (#21604)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2026-03-31 20:57:27 -07:00
Polisetty V R K Jyothendra Varma
f0303fd07e [Intel GPU] Enable DeepSeek R1 inference on XPU (#18461)
Signed-off-by: P V R K Jyothendra Varma <polisetty.v.r.k.jyothendra.varma@intel.com>
2026-03-29 22:35:59 -07:00
shuwenn
c34593f951 [HiCache] fix: graceful shutdown of pending async tasks in bench_mix.py (#20276) 2026-03-29 00:46:32 -07:00
zhangxiaolei
e2b8463c80 [fix] qwen3.5 fuse_moe_triton_tune bug (#20232) 2026-03-27 19:23:24 -04:00
Yuan Luo
f273ba1ccc [KDA] Support CuTeDSL KDA decode kernel (#21203)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2026-03-25 09:47:09 +08:00
Jiaxin(Jackson) Deng
c4db64c16b Add Lychee Doc Links Check to Local and CI (#19742)
Co-authored-by: Zijie Xia <zijie_xia@icloud.com>
Co-authored-by: Zijie Xia <zijiexia@users.noreply.github.com>
Co-authored-by: zijiexia <37504505+zijiexia@users.noreply.github.com>
2026-03-24 13:48:26 -07:00
hzh0425
0986bed8e2 [HiCache][HybridModel]: Support mamba state offloading & HybridCacheController (#20457)
Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>
Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
Co-authored-by: ispobock <ispobaoke@gmail.com>
2026-03-23 20:02:50 -07:00
Lianmin Zheng
104b10f70a refactor: consolidate is_in_ci (jit_kernel, sgl-kernel benchmarks, tests) (#21009) 2026-03-20 05:55:36 -07:00
cs-cat
22e378af86 Fix result writer in tuning_block_wise_kernel.py, and add FP8 kernel config for L40 (#20368)
Signed-off-by: cs-cat <118669451+cs-cat@users.noreply.github.com>
2026-03-20 09:28:54 +08:00
Xinyuan Tong
6b8a6545b2 Add Mistral Small 4 (Pixtral) support (#20708)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Alex Nails <alexnails@radixark.ai>
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Co-authored-by: dbari <dbari@users.noreply.github.com>
2026-03-18 14:15:32 -07:00
Yuan Luo
9c87e137ee [GDN] Support GDN packed decode (#20627) 2026-03-18 13:20:07 +08:00
Xiaoyu Zhang
25e38216b6 [kernel slimming] Clean many useless sgl-kernel deprecated kernels (#20277) 2026-03-14 16:45:54 +08:00
Chongchong Tian
70d4aabe42 Add CLI args to conveniently support tuning more models (#12922)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-12 23:10:55 -07:00
Yuan Luo
e29305c120 [GDN] Add benchmark for sglang gdn prefill (#20428)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: Kaixi Hou <kaixih@nvidia.com>
2026-03-12 22:25:02 +08:00
Mook
abc672e717 [Benchmark] use flashinfer bench_gpu_time instead of triton do_bench (#20305) 2026-03-12 04:04:30 +00:00
Yuan Luo
751c454099 Add DeepSeek3.2 and GlmMoeDsa into moe tune (#18876)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2026-03-10 17:12:58 +08:00
Mohammad Miadh Angkad
1b76eb9361 [Doc] Update version references and add automation (#18409) 2026-03-04 09:51:46 -08:00
Kangyan-Zhou
dc92f88a21 Enhance bench_multiturn.py with OpenAI API support and richer metrics (#19724)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 13:48:04 -08:00
RoyWang
a1ef8e2cc0 [AMD] optimize Kimi K2.5 fused_moe_triton performance by tuning (#19228) 2026-02-26 11:50:13 -08:00
Alison Shao
a0a8f1473c [Benchmark] Fix generated_shared_prefix attribute naming and remove args dependency (#19363)
Co-authored-by: Alison Shao <alisonshao@Mac.attlocal.net>
Co-authored-by: sglang-bot <sglangbot@gmail.com>
2026-02-25 18:45:54 -08:00
Julian Huang
a55f658835 [Misc] Normalize --host parameter to use plain hostname without scheme (#19309)
Co-authored-by: 墨楼 <huangzhilin.hzl@antgroup.com>
Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
2026-02-25 00:37:24 -08:00
Hubert Lu
17b0affbdf [AMD] Support --enable-aiter-allreduce-fusion on AMD GPUs (#13747)
Co-authored-by: yctseng0211 <yctseng@amd.com>
2026-02-24 23:11:55 -08:00
Ratish P
ae6f6e1495 [Refactor] Benchmark: Add typed DatasetArgs/Loader registry and CPU dataset unit tests (#19147)
Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>
2026-02-24 12:22:01 -08:00
Alec Leng
38f25e802d Fix/deepseek readme link (#19258) 2026-02-24 10:46:56 -08:00
Xinyuan Tong
581bf53e03 Whisper model support & /v1/audio/transcriptions endpoint & benchmark (#16983)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: MahmoudAshraf97 <hassouna97.ma@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-23 17:28:37 -08:00
Liangsheng Yin
1f2da824dd [Benchmark] Remove re-exports from bench_serving.py (#19130) 2026-02-21 14:30:30 -08:00
satyamk7054
355127c2e9 Fix benchmark_sglang_fused_moe_triton.py (#18940)
Co-authored-by: Satyam Kumar <satyamk@linkedin.com>
2026-02-17 17:25:37 -05:00
SoluMilken
07a24f1a38 update pre-commit config (#18860) 2026-02-16 00:18:31 +08:00
Ke Bao
a9d59776cc Enhence gsm8k test (#18791) 2026-02-13 18:08:57 +08:00
Liangsheng Yin
cd90346a2b Add cache hit rate UT (#18566) 2026-02-10 21:27:41 -08:00
Zheng Li
27c447653d model: support Qwen3.5 (#18489)
Co-authored-by: 瑀澈 <yuche.lz@alibaba-inc.com>
2026-02-10 00:27:59 +08:00
Xinyuan Tong
0b4d4f2838 Fix MMLU benchmark to auto-download data and resolve path issue (#18486) 2026-02-09 10:40:40 -05:00
Yuan Luo
4ea4f2a20c [VLM] Optimize get_rope_index for GLM4v (#17420)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2026-02-01 18:59:15 +08:00
b8zhong
22498e10c0 [Fix] Triton TP MoE Dpsk V3/Qwen3 Coder with SwapAB (#17965) 2026-01-31 15:56:26 +08:00
Yuan Luo
7bb41989fa [1/N] Optimize All Reduce - Benchmark different AR operations (#13797)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2026-01-26 22:44:13 +08:00
Jacob Gordon
a296c99ff4 refactor(benchmark): prevents variable shadowing (#17607) 2026-01-22 17:00:11 -05:00
Julian Huang
db2425a00b [Fix]: correctly fetch ds32 config in tuning_fused_moe_triton (#17409)
Co-authored-by: 墨楼 <huangzhilin.hzl@antgroup.com>
2026-01-20 20:08:28 +08:00
Mohammad Miadh Angkad
b0701f02b3 Fix benchmark import for should_use_tensor_core (#17232) 2026-01-16 17:48:36 -05:00
Yongfei Xu
82a1b645ba [DeepSeek V3.1/V3.2] Optimize fused moe configs for H20 & H20-3E based on swapab (#17133) 2026-01-17 00:10:52 +08:00
b8zhong
d44f09ad98 [Benchmark] Add GSM8K Platinum Eval (#14565) 2026-01-16 11:06:14 +08:00