Commit Graph

150 Commits

Author SHA1 Message Date
Cheng Wan
5f7aee726a refactor(moe): de-duplicate triton MoE runner path into shared helpers (#23019)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 17:05:13 -07:00
Hubert Lu
edaa5973d4 [AMD][No-Merge] Simplify fused allreduce + RMSNorm and remove hidden_dim allowlist (#21986)
Co-authored-by: HAI <hixiao@gmail.com>
2026-04-11 23:47:08 -07:00
satyamk7054
059b287e25 Add offline auto-tuning for LoRA CSGMV kernel (#20391)
Co-authored-by: Satyam Kumar <satyamk@linkedin.com>
2026-04-10 13:10:43 -07:00
Xinyuan Tong
2813cb6d9a [New Model] Gemma 4 (#21952)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Pengyu Chen <pychen96@gmail.com>
Co-authored-by: kpham-sgl <khoa.pham@radixark.ai>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Andy Luo <andy.luo@amd.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: adarshxs <adarsh.shirawalmath@gmail.com>
2026-04-06 20:24:44 -07:00
Xiaoyu Zhang
f3f7711dac Fix Python 3.11 f-string lint error in deepgemm Blackwell benchmark (#22108) 2026-04-04 21:15:22 +08:00
harrisonlimh
9fa12d605a Add dsv3 router gemm benchmark on blackwell (#17707) 2026-04-04 01:18:01 -07:00
Xiaoyu Zhang
ee9d922f5a Revert "[Kernel] Fuse temperature + softmax in sampling for decode speedup" (#22046) 2026-04-03 21:32:08 +08:00
Mook
7a59e05dd1 [Kernel] Fuse temperature + softmax in sampling for decode speedup (#20501) 2026-04-02 12:46:36 +08:00
Polisetty V R K Jyothendra Varma
f0303fd07e [Intel GPU] Enable DeepSeek R1 inference on XPU (#18461)
Signed-off-by: P V R K Jyothendra Varma <polisetty.v.r.k.jyothendra.varma@intel.com>
2026-03-29 22:35:59 -07:00
zhangxiaolei
e2b8463c80 [fix] qwen3.5 fuse_moe_triton_tune bug (#20232) 2026-03-27 19:23:24 -04:00
Lianmin Zheng
104b10f70a refactor: consolidate is_in_ci (jit_kernel, sgl-kernel benchmarks, tests) (#21009) 2026-03-20 05:55:36 -07:00
cs-cat
22e378af86 Fix result writer in tuning_block_wise_kernel.py, and add FP8 kernel config for L40 (#20368)
Signed-off-by: cs-cat <118669451+cs-cat@users.noreply.github.com>
2026-03-20 09:28:54 +08:00
Xiaoyu Zhang
25e38216b6 [kernel slimming] Clean many useless sgl-kernel deprecated kernels (#20277) 2026-03-14 16:45:54 +08:00
Chongchong Tian
70d4aabe42 Add CLI args to conveniently support tuning more models (#12922)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-12 23:10:55 -07:00
Mook
abc672e717 [Benchmark] use flashinfer bench_gpu_time instead of triton do_bench (#20305) 2026-03-12 04:04:30 +00:00
Yuan Luo
751c454099 Add DeepSeek3.2 and GlmMoeDsa into moe tune (#18876)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2026-03-10 17:12:58 +08:00
RoyWang
a1ef8e2cc0 [AMD] optimize Kimi K2.5 fused_moe_triton performance by tuning (#19228) 2026-02-26 11:50:13 -08:00
Hubert Lu
17b0affbdf [AMD] Support --enable-aiter-allreduce-fusion on AMD GPUs (#13747)
Co-authored-by: yctseng0211 <yctseng@amd.com>
2026-02-24 23:11:55 -08:00
satyamk7054
355127c2e9 Fix benchmark_sglang_fused_moe_triton.py (#18940)
Co-authored-by: Satyam Kumar <satyamk@linkedin.com>
2026-02-17 17:25:37 -05:00
Zheng Li
27c447653d model: support Qwen3.5 (#18489)
Co-authored-by: 瑀澈 <yuche.lz@alibaba-inc.com>
2026-02-10 00:27:59 +08:00
b8zhong
22498e10c0 [Fix] Triton TP MoE Dpsk V3/Qwen3 Coder with SwapAB (#17965) 2026-01-31 15:56:26 +08:00
Yuan Luo
7bb41989fa [1/N] Optimize All Reduce - Benchmark different AR operations (#13797)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2026-01-26 22:44:13 +08:00
Julian Huang
db2425a00b [Fix]: correctly fetch ds32 config in tuning_fused_moe_triton (#17409)
Co-authored-by: 墨楼 <huangzhilin.hzl@antgroup.com>
2026-01-20 20:08:28 +08:00
Mohammad Miadh Angkad
b0701f02b3 Fix benchmark import for should_use_tensor_core (#17232) 2026-01-16 17:48:36 -05:00
Yongfei Xu
82a1b645ba [DeepSeek V3.1/V3.2] Optimize fused moe configs for H20 & H20-3E based on swapab (#17133) 2026-01-17 00:10:52 +08:00
roikoren755
b021332339 [NemotronH] Add latent MoE support (#16227)
Signed-off-by: Roi Koren <roik@nvidia.com>
2026-01-02 22:08:58 +08:00
Xiaoyu Zhang
03b835e7d1 Refactor tuning block wise kernel and opt Qwen/Qwen3-VL-32B-Instruct-FP8 (#14141) 2025-12-08 09:24:58 +08:00
Daniel Cámpora
8428078436 Add Mistral Large 3 support. (#14213)
Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Co-authored-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-12-04 20:00:05 +08:00
Uranus
982db4ebac Feat: GLM-4.6 supports shared experts fusion (#13873)
Signed-off-by: UranusSeven <109661872+UranusSeven@users.noreply.github.com>
Co-authored-by: Kevin-XiongC <kevin_xiong1997@outlook.com>
Co-authored-by: Mingyi Jin <jinmingyi1998@sina.cn>
2025-12-01 11:33:18 +08:00
Xiaoyu Zhang
ecefc7904f [sgl-kernel Code Clean] Remove useless lightning_attention kernel (#13819) 2025-11-24 18:26:25 +08:00
roikoren755
1b48e1b974 Feat/nemotron nano v3 support (#12690) 2025-11-21 13:53:05 -08:00
Kaixi Hou
c3c4da71fb [NVIDIA] Add fp8 gemm benchmark on blackwell (#13528) 2025-11-19 19:35:00 -08:00
Junlin Zhou
0779c3d148 docs: update fused MoE config path (#13211) 2025-11-13 11:14:01 -08:00
Shu Wang
6664083522 Replace [silu_and_mul_]scaled_fp4_group_quant by Flashinfer equivalent (#12376) 2025-11-13 00:26:00 -08:00
Hubert Lu
e4b2937017 [AMD] Add AITER Custom All-Reduce (#13102)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Co-authored-by: HaiShaw <hixiao@gmail.com>
2025-11-12 21:53:44 -08:00
Xiaoyu Zhang
f18ec927f3 fix tuning_fused_moe_triton_sep tool per_channel_quant bug (#13027) 2025-11-11 10:33:54 +08:00
Xiaoyu Zhang
fc84b0730c [Refactor] Refactor fused_moe_triton tuning tools: extract shared utils, add EP/MLLM support, reduce overhead (#12440)
Co-authored-by: xu-yfei <xu-yfei@users.noreply.github.com>
Co-authored-by: Yongfei Xu <xuyongfei.xyf@antgroup.com>
2025-11-06 20:54:42 +08:00
Yuan Luo
819fc59123 Add prefix for torch symm mem (#12506)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-11-02 11:23:05 -08:00
Xinyuan Tong
82cfcd3bb8 [Refactor] tuning_fused_moe for MLLM and small refactor (#11224)
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
2025-10-31 08:54:14 +08:00
Chen1022
1ed1abfd45 feat: add EP support in tuning (#12012) 2025-10-30 07:58:50 -07:00
Xiaoyu Zhang
04e5b6faa7 Revert "Triton fused_moe_kernel support ep moe tuning" (#12377) 2025-10-30 07:12:06 -07:00
Xiaoyu Zhang
52694b60da Triton fused_moe_kernel support ep moe tuning (#12343) 2025-10-29 23:16:09 +08:00
Liana Koleva
1357397a34 feat: preview filename from tuning_fused_moe_triton.py (#12276)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2025-10-29 16:12:25 +08:00
Yongfei Xu
d2b8c4123e Opt fused triton moe: add tma for down proj kernel (#10567)
Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com>
2025-10-28 14:26:17 +08:00
Zhengyi Lai
81fd2b0ee0 fix(deepep): resolve benchmark failure on 4×IB-card setup by aligning tuning config with DeepEP commit bdd119f8 (#11965) 2025-10-22 21:20:54 -07:00
Liangsheng Yin
9d61205dac [lint] improve ruff check (#11922)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2025-10-22 11:32:50 +08:00
Cheng Wan
5b214b50b6 [Refactor] move deep_gemm_wrapper out of quantization (#11784) 2025-10-17 18:57:54 -07:00
Cheng Wan
3c06b673af [8/N] MoE Refactor: deprecate EPMoE (#11211) 2025-10-07 21:51:41 -07:00
Yuan Luo
4f42c8cd3e [sgl-kernel] Support float64 moe_sum_reduce cuda kernel (#11068)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-10-07 14:31:11 +00:00
Yuan Luo
590f2da052 [Feat] Support Torch Symm Mem AllReduce (#10571)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-10-05 13:55:19 -07:00