sglang

mirror of https://github.com/kvcache-ai/sglang.git synced 2026-04-20 14:29:32 +00:00

Author	SHA1	Message	Date
ouqingliang	eb0898d217	[refactor]: rename KT fallback logs to layerwise prefill	2026-04-09 03:26:44 +00:00
ouqingliang	a53cb6d078	add kt-numa-nodes	2026-03-31 09:30:52 +00:00
ouqingliang	6b5135e1a1	fix speculative worker and moe layer stability	2026-03-31 02:37:57 +00:00
Jianwei Dong	f6adb4f473	Merge pull request #26 from kvcache-ai/fix/sglang-kt-self-referencing-extras Fix/sglang kt self referencing extras	2026-03-04 16:54:25 +08:00
djw	a45b8d6976	[fix]: update self-referencing extras from sglang to sglang-kt	2026-03-04 08:53:14 +00:00
djw	6b8b5f4649	Use static versioning for sglang-kt, starting at 0.5.2	2026-03-04 06:41:40 +00:00
Jianwei Dong	7241d118b4	Merge pull request #25 from kvcache-ai/rename-to-sglang-kt Rename PyPI package from sglang to sglang-kt	2026-03-04 14:33:22 +08:00
djw	480c3229d5	Rename PyPI package from sglang to sglang-kt	2026-03-04 06:31:51 +00:00
ouqingliang	b3356b6c46	fix(kt): synchronize INT4 double-buffer slot reuse in fallback prefill	2026-03-03 03:45:30 +00:00
ouqingliang	f1a12b9a93	fix(moe): harden marlin routing and int4 param resolution	2026-03-02 09:35:59 +00:00
xwy-amd8	8d06c338d4	fix(kt): fix Kimi K2.5 RAWINT4 CUDA graph capture crash Three fixes for Kimi K2.5 RAWINT4 failing to start with CUDA graph: 1. fused_marlin_moe.py: Fix IndentationError from bad merge conflict resolution — imports were left outside the `if _is_cuda:` block. 2. fused_marlin_moe.py: Add early return for E=0/M=0. When kt-num-gpu-experts=0, GPU expert weights are empty tensors (E=0). The marlin MoE kernel crashes on these empty inputs. Return zeros so KT CPU experts can contribute the full result. 3. deepseek_v2.py: Skip dual-stream path for KT wrapper. The forward_normal_dual_stream uses alt_stream for shared expert parallelism, which conflicts with KT wrapper internal _cpu_stream during CUDA graph capture. Fixes #1866	2026-03-01 23:12:05 +08:00
Chen Hongtao	a8b821aee8	fix(kt): robust quant detection in layerwise fallback (#23 ) Co-authored-by: chenht2022 <chenht2022@users.noreply.github.com>	2026-02-28 19:45:03 +08:00
Chen Hongtao	e49cfbfb44	Merge pull request #22 from kvcache-ai/fix/mistral-kt-loader-remap fix(kt): harden expert remap and metadata fallback	2026-02-28 17:50:45 +08:00
chenht2022	529d06ac2b	fix(kt): harden expert remap and metadata fallback	2026-02-28 05:54:52 +00:00
xwy-amd8	48b12817a0	Fix MiMo-V2-Flash KTransformers compatibility - kt_ep_wrapper.py: normalize list-form moe_layer_freq to int MiMo-V2-Flash uses per-layer mask [0,1,1,...] instead of int freq - mimo_v2_flash.py: use getattr for pad_token_id (not in MiMo config) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-27 03:53:32 +00:00
xwy-amd8	e6428614ab	Fix: add DeepseekV3ForCausalLM to MLA detection list _load_deepseek_v32_model() rewrites architectures from DeepseekV32ForCausalLM to DeepseekV3ForCausalLM for transformers compatibility, but the MLA detection list did not include DeepseekV3ForCausalLM, causing use_mla_backend=False and MHATokenToKVPool to be created instead of NSATokenToKVPool/MLATokenToKVPool.	2026-02-26 15:23:59 +00:00
xwy-amd8	2a1bafeb16	Revert "Add DeepseekV32ForCausalLM to NSA auto-selection model_arch list" This reverts commit `01202ee43d`.	2026-02-26 14:10:22 +00:00
xwy-amd8	01202ee43d	Add DeepseekV32ForCausalLM to NSA auto-selection model_arch list DeepseekV32ForCausalLM was missing from the model_arch guard in _handle_model_specific_adjustments(), so is_deepseek_nsa() was never reached for V3.2 models. This caused the NSA attention backend to not be auto-selected, leading to q_rope TypeError with flashinfer or incorrect behavior with other backends. Upstream bug introduced in sgl-project/sglang#13687 (commit `618ca2380`) which refactored the flat is_deepseek_nsa() check into a nested block under model_arch guard but only listed DeepseekV3ForCausalLM.	2026-02-26 14:03:55 +00:00
xwy-amd8	6a63993e9f	Revert "Skip KT CPU-GPU coordination during CUDA graph capture" This reverts commit `2ba1f0dea6`.	2026-02-26 12:57:31 +00:00
xwy-amd8	2ba1f0dea6	Skip KT CPU-GPU coordination during CUDA graph capture During CUDA graph capture (regular or PCG), torch.cuda.synchronize() and CPU-GPU expert coordination are not allowed. Detect capture mode via is_in_piecewise_cuda_graph() and torch.cuda.is_current_stream_capturing(), and delegate directly to the GPU method in those cases. This enables running Qwen3.5 with --attention-backend triton without --disable-cuda-graph, improving decode from ~11 tok/s to ~65 tok/s.	2026-02-26 08:34:30 +00:00
xwy-amd8	4605b77c7f	Fix: revert kt_ep_wrapper.py for kt-kernel 0.5.1 compat, fix rope_scaling property access	2026-02-26 07:57:30 +00:00
xwy-amd8	a2f4513154	Merge upstream/main: bring in PCG (Piecewise CUDA Graph) support for Qwen3.5 GDN	2026-02-26 07:52:14 +00:00
Feng Su	d2b3c7fb14	[Tracing] update script for converting otel tracing data to perfetto format (#19396 )	2026-02-26 14:03:22 +08:00
Yilong Zhao	de3d1e7669	[misc] use ORJSONResponse in http-server generate (#19191 )	2026-02-25 21:26:25 -08:00
Alison Shao	0fd44ff342	Fix NSA CP positions mismatch in eagle NextN model (#19367 )	2026-02-25 20:14:33 -08:00
Xinyu Zhang	119c91cb8b	Skip signal handler registration when not on main thread (#18752 )	2026-02-25 19:30:05 -08:00
Qi Yuhang	88ad3b894a	[sgl-kernel][Feat][B200][2/N] Support MXFP8 Grouped GEMM in Blackwell (#14640 ) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>	2026-02-26 11:23:37 +08:00
Minglei Zhu	b3202fe6d0	[PCG] fix piecewise cuda graph for Qwen3.5 (#19220 )	2026-02-26 11:16:52 +08:00
Alison Shao	a0a8f1473c	[Benchmark] Fix generated_shared_prefix attribute naming and remove args dependency (#19363 ) Co-authored-by: Alison Shao <alisonshao@Mac.attlocal.net> Co-authored-by: sglang-bot <sglangbot@gmail.com>	2026-02-25 18:45:54 -08:00
sglang-bot	6e82183f5a	[Disagg] Route disagg prefill results through `process_batch_result` (#19364 )	2026-02-25 18:38:39 -08:00
Xiaoyu Zhang	914ed34757	update jit_kernel codeowners (#19385 )	2026-02-26 10:36:34 +08:00
fzyzcjy	265eb56d44	Support multi-step alignment and pipeline integration in dump comparator (#19378 )	2026-02-26 10:23:22 +08:00
Yuan Luo	4e843f1216	[DeepSeek-V3.2][JIT-kernel] Support nsa fuse store indexer k cache (#19148 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: DarkSharpness <76582120+darksharpness@users.noreply.github.com>	2026-02-26 10:23:10 +08:00
Michael	f230967e65	[AMD] Fix ROCm Docker builds, update apache-tvm-ffi (#19359 )	2026-02-26 10:16:28 +08:00
fzyzcjy	f9a2f0398f	Support token aligner planning and execution in dump comparator (#19377 )	2026-02-26 10:04:33 +08:00
fzyzcjy	d34d5aca07	Support loading token aligner data in dump comparator (#19376 )	2026-02-26 10:03:56 +08:00
fzyzcjy	e8dd14519d	Add aligner entrypoint and bundle handler in dump comparator (#19375 )	2026-02-26 10:03:22 +08:00
pansicheng	2ad475b4ed	use flashinfer.sampling (#18696 )	2026-02-26 10:02:38 +08:00
fzyzcjy	2739d7df62	Reorganize modules and pipeline in dump comparator (#19374 )	2026-02-26 10:00:13 +08:00
fzyzcjy	508b8e3387	Handle warnings via sink for structured output and add pair in dump comparator (#19373 )	2026-02-26 09:59:15 +08:00
fzyzcjy	46321ee70e	Support dumping rid for correlation across passes in dump comparator (#19372 )	2026-02-26 09:57:57 +08:00
Yuan Luo	7c9e8e2def	[Re-land][jit kernel] Support per_token_group_quant_8bit jit kernel (#19140 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: Mohammad Miadh Angkad <mangkad.bsdsba2027@aim.edu>	2026-02-26 09:53:57 +08:00
Linyu Wu	beabaa8d37	[Kernel Slimming] Migrate marlin moe kernel to JIT (#19181 ) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>	2026-02-26 09:05:13 +08:00
Daniel Cámpora	350190487b	Flashinfer MOE FP8 support for Mistral Large 3. (#15422 ) Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com>	2026-02-25 15:00:37 -08:00
Liangsheng Yin	c60dcc40bb	[Logging] Guard `log_prefill_stats` against idle batches in disagg prefill (#19361 )	2026-02-25 13:31:52 -08:00
YAMY	08957c88ea	[Logging] Fix prefill side logging in pd disagg (#19350 )	2026-02-25 12:42:18 -08:00
Kangyan-Zhou	306c552639	Revert "Fix HybridAttnBackend forward for linear attention" (#19356 )	2026-02-25 11:49:50 -08:00
HAI	a0f3361023	Update (#19351 )	2026-02-25 11:37:07 -08:00
jacky.cheng	b2c46fc60b	[AMD] Support Qwen3-Coder-Next on AMD platform (#18355 ) Co-authored-by: yichiche@amd.com <jacky.cheng>	2026-02-25 11:06:22 -08:00
Alison Shao	cc1ca61c81	fix: add --cuda-graph-max-bs to DSV3 FA3 FP8 KV cache test (#19307 )	2026-02-25 10:27:31 -08:00

1 2 3 4 5 ...

10204 Commits