Commit Graph

  • 5d6bef9f61 fix(v4): auto-bump swa_full_tokens_ratio to fit chunks_in_flight (#60) main Benjamin 2026-06-25 16:19:52 +08:00
  • 1344f647a6 fix: kt_ep_wrapper silently fails to import (#59) usrlocalben 2026-06-23 23:11:25 -05:00
  • 37eecb4e99 Enable DeepSeek V4 Flash inference on Ampere GPUs (#58) chenghanke 2026-06-22 13:53:55 +08:00
  • 8b636f9008 Feat/minimax m3 (#56) Benjamin 2026-06-21 17:09:49 +08:00
  • 31124d0ae7 fix: kt_ep_wrapper silently fails to import after a2f451315 (#57) usrlocalben 2026-06-21 01:03:44 -05:00
  • a2a8a7e9e0 support glm5.2 Jianwei Dong 2026-06-16 16:33:56 +08:00
  • ee149528a3 fix(dsa): wire skip_topk-gated indexer for GlmMoeDsa to unblock GLM-5.2 yyj 2026-06-13 20:19:01 +08:00
  • dd9ba529f6 [Bugfix] Restore overridden HF config fields and support index_skip_topk_offset for DSA topk sharing (#27114) Yuxuan Zhang 2026-06-06 13:26:04 +08:00
  • 51032b7127 feat: support end-to-end KT LoRA serving for Qwen3.5 MoE (#53) Jiaheng Dai 2026-06-05 15:37:28 +08:00
  • 6467cf5b24 feat(kse): port unified-sparse-kv KVCache Sparsity Engine from magicYang1573/sglang copilot/move-unified-sparse-kv copilot-swe-agent[bot] 2026-05-26 08:11:06 +00:00
  • d7567978e5 debug: add more NaN detection points + combined GPU/CPU report fix/nan-4090 yyj 2026-05-15 13:57:31 +08:00
  • 8546927613 debug: add NaN detection in MoE internals yyj 2026-05-15 13:40:43 +08:00
  • 83207724ce debug: add NaN detection points for 4090 debugging yyj 2026-05-15 13:26:27 +08:00
  • ebaff7729b fix: regressions (scheduler hang, cuda graph TypeError, MXFP4 cache, rsf double-apply) (#50) Benjamin F 2026-05-14 14:00:27 +08:00
  • 9fa1de6bce fix: remove undefined _GraphBucket reference in cuda graph replay pr50 yyj 2026-05-14 05:22:32 +00:00
  • 91277ffd6e Merge pull request #1 from yyj6666667/fix/scheduler-req-pool-regression Benjamin F 2026-05-13 17:36:44 +08:00
  • 2763727f30 fix(cuda_graph): use out-of-band _replay_forward_batch for non-DSV4 backends fix/scheduler-req-pool-regression yyj 2026-05-13 17:18:29 +08:00
  • de9d7bf83a fix(scheduler): revert PR #38 req_pool changes that break TP-only mode yyj 2026-05-13 16:38:51 +08:00
  • bedcff3786 fix(v4-flash): remove broken MXFP4 weight cache + fix rsf double-apply yyj 2026-05-11 21:09:52 +08:00
  • 00126648e2 [PD] Add EFA disaggregation transport support_multi_protocol Teng Ma 2026-05-10 23:51:29 +08:00
  • c49ee54e73 Merge remote-tracking branch 'origin/main' into pr-21859-support-multi-protocol Teng Ma 2026-05-10 22:35:49 +08:00
  • 335dbd60b4 Support Intern-S2-Preview (#24875) RunningLeon 2026-05-10 22:17:30 +08:00
  • 59faf986b2 [PD] Unify dsv4 dispatch with swa (#24888) Ke Bao 2026-05-10 22:01:13 +08:00
  • 2f06867128 Optimize MHC pipeline: DeepGemm, fused norm, fused hc_head (#24775) Yuhao Yang 2026-05-10 19:03:37 +08:00
  • bd0aa22309 Fix PD bootstrap failure handling (#24772) Yuhao Yang 2026-05-10 19:02:47 +08:00
  • 8cc16c9974 [Spec] Cleanup idle stub and shape-check patterns (#24881) Liangsheng Yin 2026-05-10 02:39:53 -07:00
  • c7f674e427 [Bug] Add dsv4 state_type branch to mooncake disaggregation (#24878) Cheng Wan 2026-05-10 01:13:46 -07:00
  • d08744238a [Spec V1] Split draft-extend phase from EagleDraftInput into new EagleDraftExtendInput (#24859) Liangsheng Yin 2026-05-10 01:07:45 -07:00
  • d3fd91ed97 [Gemma4] Optimize Gemm4 with fused Q/K/V RMSNorm + per-expert FP8 ckpt loader (#24696) Yuan Luo 2026-05-10 15:24:12 +08:00
  • a87fb399de [spec decoding] support kimi-k2.5-eagle3-mla (#24826) Qiaolin Yu 2026-05-09 23:57:39 -07:00
  • b4d347e86e [SPEC V2] fix: skip stale state updates in spec-v2 overlap (#23456) shuwenn 2026-05-10 14:56:24 +08:00
  • cfd3fd00d0 [RL] Call torch.cuda.empty_cache() for in-place pause mode to avoid OOM (#24854) Byron Hsu 2026-05-09 23:36:52 -07:00
  • 44efc23a9a [diffusion] CI: add cache-dit CI tests (#19213) Chi McIsaac 2026-05-10 01:38:41 -04:00
  • 1e6c6d1f07 [Utils] Make request dump robust to unpicklable server_args and large meta_info (#24767) Byron Hsu 2026-05-09 21:41:41 -07:00
  • 9578ba1b57 [Utils] Refactor device cache emptying (#24861) Stefan He 2026-05-09 21:28:00 -07:00
  • 47483001b6 [PrefillDelayer] support NCCL all-gather for cross-DP info sync (#24768) Byron Hsu 2026-05-09 21:20:03 -07:00
  • 7edb4c3cea [NUMA+Ray] Fix NUMA NVML handle resolution under shuffled CUDA_VISIBLE_DEVICES (#24766) Byron Hsu 2026-05-09 21:18:39 -07:00
  • f9c315e85d docs: clarify how /tag-and-rerun-ci kicks off CI on the current commit (#24774) Byron Hsu 2026-05-09 20:28:41 -07:00
  • c95454b341 speculative: drop dead params/returns/no-ops (#24865) Liangsheng Yin 2026-05-09 15:53:31 -07:00
  • b735ca178c Update CODEOWNERS for /sgl-kernel/csrc/musa (#24746) R0CKSTAR 2026-05-10 05:45:13 +08:00
  • 12f42f2e7e Support Gemma3/4 + Eagle3 (#23976) Charles Chen 2026-05-09 13:34:56 -07:00
  • 8087e07d52 [UnifiedRadixTree]: Align cache_empty_result with RadixTree (#24779) luchangli 2026-05-09 23:52:22 +08:00
  • ef5e9f8aba [DSV4] Cherry pick missing commits from deepseek_v4 branch and enhance tests (#24793) Baizhou Zhang 2026-05-09 04:15:37 -07:00
  • 4b23f6bdc5 Fix performance regression on Deepseek V3 on moe-runner-backend=triton on SM90 (#24562) Brayden Zhong 2026-05-09 06:49:12 -04:00
  • 05d1ab51e8 Enable PDL for various kernels in DSV32/GLM5 (#23965) Brayden Zhong 2026-05-09 06:42:56 -04:00
  • d5564c2a96 fix(fa3): translate page table to SWA loc in EAGLE3 topk>1 spec metadata (#24617) shuwenn 2026-05-09 18:22:45 +08:00
  • a309f1f8f4 fix(cuda_graph): zero out_cache_loc_swa on pad and use int32 (hybrid-SWA accuracy fix) (#24743) JoyFuture 2026-05-09 18:22:12 +08:00
  • ba625d5290 slash command rerun UX: emoji semantics + result writeback (#24802) Liangsheng Yin 2026-05-09 03:19:24 -07:00
  • f4b7e73699 Enable trtllm-gen BF16 MoE for MTP (#24260) Brayden Zhong 2026-05-09 06:14:17 -04:00
  • f1a9a455e0 Revert "[NPU] fix profiler on npu" (#24815) sglang-npu-bot 2026-05-09 17:53:02 +08:00
  • e2527df8b6 [NPU] fix profiler on npu (#24685) zhaozx-cn 2026-05-09 17:48:24 +08:00
  • fd636410a2 Restrict fa_skip_kv_cache to non-MLA backends (#24097) Jia Guo 2026-05-09 02:25:02 -07:00
  • 8f33bee31b Reland Cute-DSL FP4 dense GEMM (#23590) Brayden Zhong 2026-05-09 05:20:58 -04:00
  • 43ed1ec77a refactor(dsv4): isolate DeepSeek V4 Flash behind plugin registries (#47) Benjamin F 2026-05-09 16:33:18 +08:00
  • d49fc092cb [Bug Fix] GLM-5.1: drop constexpr on page_indice_batch_offset, skip offloader post_init on draft worker, support N=32 in copy_to_gpu_no_ce (#23550) Yuxuan Zhang 2026-05-09 09:43:45 +02:00
  • 9d12f9e6fa [HiCache] ci: lower est_time for test_hicache_spec_file_storage (#24713) shuwenn 2026-05-09 15:33:18 +08:00
  • 78da0d3106 [Spec] Move accept_tokens off EagleDraftInput; pass via method arg (#24735) Liangsheng Yin 2026-05-08 23:24:18 -07:00
  • 1610aa77ab Reduce gemma4 moe deterministic test runtime (#24754) Khoa Pham 2026-05-08 20:46:56 -07:00
  • 8e534e8f15 [diffusion] fix: fix diffusers executor crash when component residency manager is absent (#24573) Chi McIsaac 2026-05-08 23:45:06 -04:00
  • 44a527f6f4 fix patch_torch test queue race (#24739) Liangsheng Yin 2026-05-08 20:25:59 -07:00
  • 590b13b513 [diffusion] fix: fix NCCL deadlock in ulysses sp when sequence length has remainder (#24694) storyicon 2026-05-09 11:05:37 +08:00
  • 50ed01674e fix is_arch_support_pdl function usage (#24600) Polisetty V R K Jyothendra Varma 2026-05-09 07:09:34 +05:30
  • 1613bae412 [Spec] Disambiguate verified_id into bonus_token(s) / accept_tokens (#24724) Liangsheng Yin 2026-05-08 18:24:33 -07:00
  • a61a14f416 [KDA] Optimize prefill kernels with diagonal and recompute fuse (#24271) Yuan Luo 2026-05-09 08:52:51 +08:00
  • 9ee830346f Disable Custom AR V2 when in multi-node (#24729) Brayden Zhong 2026-05-08 20:50:05 -04:00
  • d1c5937428 env: add SGLANG_RADIX_FORCE_MISS to force radix prefix-cache miss (#24726) Cheng Wan 2026-05-08 17:46:38 -07:00
  • 560829a171 feat(scheduler): add adaptive queue-based prefill delayer trigger (#23189) YAMY 2026-05-08 16:54:30 -07:00
  • 6971a03fe6 fix(fa3): skip scheduler_metadata precompute under DP attention (#24632) YAMY 2026-05-08 16:19:20 -07:00
  • 62c2e091f6 [PD] MORI-IO: Add state transfer, inline transfer model, and high-concurrency fixes (#22665) Niko Ma 2026-05-09 07:07:22 +08:00
  • 190b15c8fe [AMD] Register 8 CPU-bound unit tests for AMD 1-GPU PR CI (#24569) Michael 2026-05-09 07:01:58 +08:00
  • 5fbec0e445 ci: prune per-commit CUDA tests — move 25 files + 13 testcases to test/manual/ (#24721) Alison Shao 2026-05-08 15:53:23 -07:00
  • aefd8e257f Re-land #23109: rebase-required mode + fix for grep-no-match abort (#24180) Alison Shao 2026-05-08 15:28:57 -07:00
  • fa8985486e [test/fix]: isolate VLM MMMU eval output dirs to fix nightly-4-gpu cross-test pollution (#24623) Jimmy Shong 2026-05-08 17:01:53 -05:00
  • 5dc4c7bef1 Add speculative decoding naming convention rule (#24094) Liangsheng Yin 2026-05-08 14:52:31 -07:00
  • 096ad02b06 [Model] Laguna-XS.2 Model Support (#24204) Jimmy Shong 2026-05-08 16:43:13 -05:00
  • 7b707c9222 disable the combination of --enable-two-batch-overlap and --enforce-s… (#24720) Cheng Wan 2026-05-08 14:27:35 -07:00
  • 09912fd89d Remove unnecessary bf16 assert in rotate_activation (#24686) Yuhao Yang 2026-05-09 05:00:52 +08:00
  • f30d1d0b0a logits: remove blocking H2D copy (#24627) Yilong Zhao 2026-05-08 13:22:13 -07:00
  • 672f778512 [NemotronH] Fix expert scale weight loading (#24434) Ethan Feng 2026-05-09 03:37:06 +08:00
  • 2cf1a4ab38 feat: Add KV events for Mamba radix cache (#23678) zhongdaor-nv 2026-05-08 11:53:36 -07:00
  • ca7a8cc61d [Bugfix] Fix a bug causing NVFP4 to be tested on all gpus like SM90 devices. (#24604) Xu Zou 2026-05-09 02:51:30 +08:00
  • e40e339c72 Filter non-int token ids in benchmark and observe decode-side bootstrap/alloc metrics (#24684) Lianmin Zheng 2026-05-08 11:45:37 -07:00
  • 73b8eda103 [diffusion] fix: fix FA3 varlen out argument handling (#24688) Mick 2026-05-08 19:01:49 +08:00
  • 17888fa92a [diffusion] doc: update ltx2 multi-gpu deployment guide (#24682) Mick 2026-05-08 18:38:05 +08:00
  • 7f8e7a9130 fix(aiter): drop FP8 KV upcast; use native FP8 path in paged_attentio… (#24129) fanxingran 2026-05-08 17:47:48 +08:00
  • f21d4868dc [AMD] Replace naive triton RMSNorm with aiter RMSNorm for diffusion model (#24360) jacky.cheng 2026-05-08 17:44:13 +08:00
  • e1150f66db [AMD][diffusion] Temporal-unfolded batched Conv2D for ROCm VAE decode (#22971) YC Yen-Ching Tseng 2026-05-08 17:32:14 +08:00
  • d32e283947 [NPU] [DOC] refresh npu supported model list (#24676) amote-i 2026-05-08 17:08:15 +08:00
  • 80d0226b68 Turn on JIT custom AR implementation by default (#24363) Brayden Zhong 2026-05-08 05:05:31 -04:00
  • 73792629d4 [AMD] Intro SGLANG_DIFFUSION_AITER_FP8_ATTN (#24677) HAI 2026-05-08 01:31:00 -07:00
  • 76a1f169b3 [AMD] Add AMD FP8 MLA attention test for Wan2.2-T2V-A14B (#23955) jacky.cheng 2026-05-08 16:03:51 +08:00
  • b22d3cd606 [AMD] Support fp8 MLA for diffusion model (#20319) jacky.cheng 2026-05-08 15:56:24 +08:00
  • 19afe73e03 [AMD] Cherry-pick aiter commit for mhc_pre fix (#24665) Thomas Wang 2026-05-08 14:49:39 +08:00
  • 55d8223c2b [sgl-kernel/cpu] support w8a8 int8 model for arm cpu (#16045) Yibo Cai 2026-05-08 14:47:06 +08:00
  • 47e9ec11ad [NPU] [DOC] fix ascend_npu_support_new_models TOC (#24658) amote-i 2026-05-08 14:07:00 +08:00
  • e1bc001872 fix(mimo_v2): auto-disable multimodal when vision/audio configs are absent (#24652) JoyFuture 2026-05-08 13:40:08 +08:00
  • 7deed98e1b [fix] /pause_generation and /continue_generation wrong for --tokenizer-worker-num > 1 (#24462) maocheng23 2026-05-07 21:32:21 -07:00
  • 2afb450501 [diffusion] optimize: optimize frame returns path (#24616) Mick 2026-05-08 12:10:09 +08:00
  • cdf5771f91 [MUSA][17/N] ci: Add MUSA diffusion, sgl-kernel tests, and CI workflow support (#20672) johnnycxm 2026-05-08 11:45:21 +08:00
  • 15e6572f21 [MUSA][18/N] Add MUSA-optimized kernel implementations for hot ops (#23255) Joey 2026-05-08 11:38:33 +08:00