sglang

mirror of https://github.com/kvcache-ai/sglang.git synced 2026-07-01 04:08:10 +00:00

Author	SHA1	Message	Date
Cheng Wan	5f7aee726a	refactor(moe): de-duplicate triton MoE runner path into shared helpers (#23019 ) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 17:05:13 -07:00
Hubert Lu	edaa5973d4	[AMD][No-Merge] Simplify fused allreduce + RMSNorm and remove hidden_dim allowlist (#21986 ) Co-authored-by: HAI <hixiao@gmail.com>	2026-04-11 23:47:08 -07:00
satyamk7054	059b287e25	Add offline auto-tuning for LoRA CSGMV kernel (#20391 ) Co-authored-by: Satyam Kumar <satyamk@linkedin.com>	2026-04-10 13:10:43 -07:00
Xinyuan Tong	2813cb6d9a	[New Model] Gemma 4 (#21952 ) Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: Pengyu Chen <pychen96@gmail.com> Co-authored-by: kpham-sgl <khoa.pham@radixark.ai> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Andy Luo <andy.luo@amd.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: adarshxs <adarsh.shirawalmath@gmail.com>	2026-04-06 20:24:44 -07:00
Xiaoyu Zhang	f3f7711dac	Fix Python 3.11 f-string lint error in deepgemm Blackwell benchmark (#22108 )	2026-04-04 21:15:22 +08:00
harrisonlimh	9fa12d605a	Add dsv3 router gemm benchmark on blackwell (#17707 )	2026-04-04 01:18:01 -07:00
Xiaoyu Zhang	ee9d922f5a	Revert "[Kernel] Fuse temperature + softmax in sampling for decode speedup" (#22046 )	2026-04-03 21:32:08 +08:00
Mook	7a59e05dd1	[Kernel] Fuse temperature + softmax in sampling for decode speedup (#20501 )	2026-04-02 12:46:36 +08:00
Polisetty V R K Jyothendra Varma	f0303fd07e	[Intel GPU] Enable DeepSeek R1 inference on XPU (#18461 ) Signed-off-by: P V R K Jyothendra Varma <polisetty.v.r.k.jyothendra.varma@intel.com>	2026-03-29 22:35:59 -07:00
zhangxiaolei	e2b8463c80	[fix] qwen3.5 fuse_moe_triton_tune bug (#20232 )	2026-03-27 19:23:24 -04:00
Lianmin Zheng	104b10f70a	refactor: consolidate is_in_ci (jit_kernel, sgl-kernel benchmarks, tests) (#21009 )	2026-03-20 05:55:36 -07:00
cs-cat	22e378af86	Fix result writer in tuning_block_wise_kernel.py, and add FP8 kernel config for L40 (#20368 ) Signed-off-by: cs-cat <118669451+cs-cat@users.noreply.github.com>	2026-03-20 09:28:54 +08:00
Xiaoyu Zhang	25e38216b6	[kernel slimming] Clean many useless sgl-kernel deprecated kernels (#20277 )	2026-03-14 16:45:54 +08:00
Chongchong Tian	70d4aabe42	Add CLI args to conveniently support tuning more models (#12922 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-03-12 23:10:55 -07:00
Mook	abc672e717	[Benchmark] use flashinfer bench_gpu_time instead of triton do_bench (#20305 )	2026-03-12 04:04:30 +00:00
Yuan Luo	751c454099	Add DeepSeek3.2 and GlmMoeDsa into moe tune (#18876 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>	2026-03-10 17:12:58 +08:00
RoyWang	a1ef8e2cc0	[AMD] optimize Kimi K2.5 fused_moe_triton performance by tuning (#19228 )	2026-02-26 11:50:13 -08:00
Hubert Lu	17b0affbdf	[AMD] Support --enable-aiter-allreduce-fusion on AMD GPUs (#13747 ) Co-authored-by: yctseng0211 <yctseng@amd.com>	2026-02-24 23:11:55 -08:00
satyamk7054	355127c2e9	Fix benchmark_sglang_fused_moe_triton.py (#18940 ) Co-authored-by: Satyam Kumar <satyamk@linkedin.com>	2026-02-17 17:25:37 -05:00
Zheng Li	27c447653d	model: support Qwen3.5 (#18489 ) Co-authored-by: 瑀澈 <yuche.lz@alibaba-inc.com>	2026-02-10 00:27:59 +08:00
b8zhong	22498e10c0	[Fix] Triton TP MoE Dpsk V3/Qwen3 Coder with SwapAB (#17965 )	2026-01-31 15:56:26 +08:00
Yuan Luo	7bb41989fa	[1/N] Optimize All Reduce - Benchmark different AR operations (#13797 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>	2026-01-26 22:44:13 +08:00
Julian Huang	db2425a00b	[Fix]: correctly fetch ds32 config in tuning_fused_moe_triton (#17409 ) Co-authored-by: 墨楼 <huangzhilin.hzl@antgroup.com>	2026-01-20 20:08:28 +08:00
Mohammad Miadh Angkad	b0701f02b3	Fix benchmark import for should_use_tensor_core (#17232 )	2026-01-16 17:48:36 -05:00
Yongfei Xu	82a1b645ba	[DeepSeek V3.1/V3.2] Optimize fused moe configs for H20 & H20-3E based on swapab (#17133 )	2026-01-17 00:10:52 +08:00
roikoren755	b021332339	[NemotronH] Add latent MoE support (#16227 ) Signed-off-by: Roi Koren <roik@nvidia.com>	2026-01-02 22:08:58 +08:00
Xiaoyu Zhang	03b835e7d1	Refactor tuning block wise kernel and opt Qwen/Qwen3-VL-32B-Instruct-FP8 (#14141 )	2025-12-08 09:24:58 +08:00
Daniel Cámpora	8428078436	Add Mistral Large 3 support. (#14213 ) Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Co-authored-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>	2025-12-04 20:00:05 +08:00
Uranus	982db4ebac	Feat: GLM-4.6 supports shared experts fusion (#13873 ) Signed-off-by: UranusSeven <109661872+UranusSeven@users.noreply.github.com> Co-authored-by: Kevin-XiongC <kevin_xiong1997@outlook.com> Co-authored-by: Mingyi Jin <jinmingyi1998@sina.cn>	2025-12-01 11:33:18 +08:00
Xiaoyu Zhang	ecefc7904f	[sgl-kernel Code Clean] Remove useless lightning_attention kernel (#13819 )	2025-11-24 18:26:25 +08:00
roikoren755	1b48e1b974	Feat/nemotron nano v3 support (#12690 )	2025-11-21 13:53:05 -08:00
Kaixi Hou	c3c4da71fb	[NVIDIA] Add fp8 gemm benchmark on blackwell (#13528 )	2025-11-19 19:35:00 -08:00
Junlin Zhou	0779c3d148	docs: update fused MoE config path (#13211 )	2025-11-13 11:14:01 -08:00
Shu Wang	6664083522	Replace [silu_and_mul_]scaled_fp4_group_quant by Flashinfer equivalent (#12376 )	2025-11-13 00:26:00 -08:00
Hubert Lu	e4b2937017	[AMD] Add AITER Custom All-Reduce (#13102 ) Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> Co-authored-by: HaiShaw <hixiao@gmail.com>	2025-11-12 21:53:44 -08:00
Xiaoyu Zhang	f18ec927f3	fix tuning_fused_moe_triton_sep tool per_channel_quant bug (#13027 )	2025-11-11 10:33:54 +08:00
Xiaoyu Zhang	fc84b0730c	[Refactor] Refactor fused_moe_triton tuning tools: extract shared utils, add EP/MLLM support, reduce overhead (#12440 ) Co-authored-by: xu-yfei <xu-yfei@users.noreply.github.com> Co-authored-by: Yongfei Xu <xuyongfei.xyf@antgroup.com>	2025-11-06 20:54:42 +08:00
Yuan Luo	819fc59123	Add prefix for torch symm mem (#12506 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>	2025-11-02 11:23:05 -08:00
Xinyuan Tong	82cfcd3bb8	[Refactor] tuning_fused_moe for MLLM and small refactor (#11224 ) Co-authored-by: Cursor Agent <cursoragent@cursor.com>	2025-10-31 08:54:14 +08:00
Chen1022	1ed1abfd45	feat: add EP support in tuning (#12012 )	2025-10-30 07:58:50 -07:00
Xiaoyu Zhang	04e5b6faa7	Revert "Triton fused_moe_kernel support ep moe tuning" (#12377 )	2025-10-30 07:12:06 -07:00
Xiaoyu Zhang	52694b60da	Triton fused_moe_kernel support ep moe tuning (#12343 )	2025-10-29 23:16:09 +08:00
Liana Koleva	1357397a34	feat: preview filename from tuning_fused_moe_triton.py (#12276 ) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>	2025-10-29 16:12:25 +08:00
Yongfei Xu	d2b8c4123e	Opt fused triton moe: add tma for down proj kernel (#10567 ) Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com>	2025-10-28 14:26:17 +08:00
Zhengyi Lai	81fd2b0ee0	fix(deepep): resolve benchmark failure on 4×IB-card setup by aligning tuning config with DeepEP commit bdd119f8 (#11965 )	2025-10-22 21:20:54 -07:00
Liangsheng Yin	9d61205dac	[lint] improve ruff check (#11922 ) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>	2025-10-22 11:32:50 +08:00
Cheng Wan	5b214b50b6	[Refactor] move `deep_gemm_wrapper` out of `quantization` (#11784 )	2025-10-17 18:57:54 -07:00
Cheng Wan	3c06b673af	[8/N] MoE Refactor: deprecate `EPMoE` (#11211 )	2025-10-07 21:51:41 -07:00
Yuan Luo	4f42c8cd3e	[sgl-kernel] Support float64 moe_sum_reduce cuda kernel (#11068 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>	2025-10-07 14:31:11 +00:00
Yuan Luo	590f2da052	[Feat] Support Torch Symm Mem AllReduce (#10571 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>	2025-10-05 13:55:19 -07:00

1 2 3

150 Commits