sglang

kvcache-ai/sglang

Fork 0

mirror of https://github.com/kvcache-ai/sglang.git synced 2026-06-30 03:37:51 +00:00

Commit Graph

Select branches

Hide Pull Requests

Xuc/ckpt-engine

Xuc/evict

Xuc/pd-fallback

Xuc/sglang

add_batch_reg_support

add_batch_transfer_note

add_completion_endpoint

add_custom_memory_pool

add_dynamic_connect

add_dynamic_connector_v2

add_dynamic_import_storage_backend

add_encoder_cache_doc

add_fastapi_for_hicache

add_gpu_id_device_topo_support

add_hicache_mooncake_gpu_ib_topo

add_hicache_pp_support

add_hicache_pp_test

add_local_ip_nic_option

add_mooncake_backend_extra_config

add_mooncake_connector

add_mooncake_connector2

add_multi_tenant_with_prefix_tag

add_nvlink_doc

add_pd_env_var_doc

add_pd_test

add_raise_mooncake_exception

add_test_hicache_storage_mooncake_backend

auto-detect-net

bench_refactor

bump_mc

chore/kt-layerwise-prefill-label

chore/kt-layerwise-prefill-main

copilot/move-unified-sparse-kv

copilot/pr20535-hicache-tests

dev-djw

dev-djw-ci

dev-djw-dsv32

dev/mateng

dev_kimi_k2

develop-cht-defer

evict_policy

experts_sched

feat/kt-numa-nodes

feature/agent-aware-streaming-session

fix-cross-instance-prefetch

fix-kt-061-torch-range

fix/layerwise-prefill-quant-detect

fix/mistral-kt-loader-remap

fix/nan-4090

fix/scheduler-req-pool-regression

fix/sglang-kt-self-referencing-extras

fix_8_gpu_ci

fix_abort_finish_reason

fix_abort_is_single

fix_ci_deps

fix_ci_time

fix_doc_miss_index

fix_dp_non_mla_different_tp

fix_dummy_attn_backend

fix_failure_test

fix_flaky_logprobs_test

fix_gb200

fix_hicache_mooncake_test

fix_layernorm_import

fix_main_ci

fix_main_ci_2

fix_main_lint

fix_mc_nvlink

fix_mooncake_ci

fix_mooncake_storage_backend_per_tp_config

fix_mooncake_store_batch_set_v1_bug

fix_mooncake_store_test_bug

fix_multi_tokenizer

fix_nightly

fix_nvlink_accuracy

fix_nvlink_accuracy_issue

fix_nvlink_doc

fix_pd_abort

fix_pd_code_format

fix_pd_router_test

fix_pp_single_node_test

fix_prefill_timeout

fix_prefill_unbootstrap_failure

fix_router_ci

fix_storage_backend_clear

fix_tbo_missing

fix_test_disaggregation

fix_tracker_gc

fix_transfer_infos

fixing_pd_pp

glm-46v

glm4.7-eplb

glm47

handle_prefill_failure

improve_pd_ut

kimi_k2

kimi_k2.5

kt-fork-tmp

kt-qwen3vl

layerwise-prefill-opt

layerwise-prefill-opt-1

layerwise_prefill

leipi/devel

ljq-dev

load_format

lora-refactor

lora-refactor-bench

lora-refactor-debug

main

mateng/agent-aware-kvcache-phase1-origin-main

mateng/mooncake-ib-device-routing

mateng/sglang

merge-deepseek-v4

merge_main_oql

mini-lb

mooncake_bench_integrate

mooncake_transfer_engine_refactor

optimize_ib_device_help_info

optimize_mooncake

oql-merge

oql_fix_k2

oql_test_mtp

pd_spec_improve

pd_support_dp_attn

pr-20129-test

pr50

prefill-limit

prevent_grammar_crash_pd

qwen3.5

refact_ckpt_engine

refactor

refactor_mc_ci

refactor_mooncake_conn

refactor_pd_suite

refactor_pd_tests

rename-to-sglang-kt

revert_thread_pool

sgl_different_tp_size

sgl_multi_prefill

sglang-kt-ci

sglang-kt-ci-merge

sglang-kt-fork

sglang-kt-unify

ship_pp_calculate_logic

support_ckpt_engine_as_model_load

support_dont_store_kvcache

support_hybrid_pd

support_json_topo_for_pd

support_multi_protocol

support_pd_pp

temp-fix-scaling-factor

tiny_fix_kvpoll

tmp-1

update_mooncake_store_doc

update_owner

update_pd_abort_log

upgrade_mc_0.3.6

#1

#10

#11

#12

#13

#14

#15

#16

#17

#18

#18

#19

#2

#20

#21

#22

#23

#24

#25

#26

#27

#27

#28

#28

#29

#29

#3

#30

#31

#32

#33

#33

#34

#35

#36

#37

#38

#39

#4

#40

#41

#42

#43

#44

#45

#45

#46

#46

#47

#48

#49

#49

#5

#5

#50

#51

#51

#52

#53

#54

#55

#56

#57

#58

#59

#6

#60

#61

#61

#62

#62

#7

#8

#9

5d6bef9f61 fix(v4): auto-bump swa_full_tokens_ratio to fit chunks_in_flight (#60) main Benjamin 2026-06-25 16:19:52 +08:00
1344f647a6 fix: kt_ep_wrapper silently fails to import (#59) usrlocalben 2026-06-23 23:11:25 -05:00
37eecb4e99 Enable DeepSeek V4 Flash inference on Ampere GPUs (#58) chenghanke 2026-06-22 13:53:55 +08:00
8b636f9008 Feat/minimax m3 (#56) Benjamin 2026-06-21 17:09:49 +08:00
31124d0ae7 fix: kt_ep_wrapper silently fails to import after a2f451315 (#57) usrlocalben 2026-06-21 01:03:44 -05:00
a2a8a7e9e0 support glm5.2 Jianwei Dong 2026-06-16 16:33:56 +08:00
ee149528a3 fix(dsa): wire skip_topk-gated indexer for GlmMoeDsa to unblock GLM-5.2 yyj 2026-06-13 20:19:01 +08:00
dd9ba529f6 [Bugfix] Restore overridden HF config fields and support index_skip_topk_offset for DSA topk sharing (#27114) Yuxuan Zhang 2026-06-06 13:26:04 +08:00
51032b7127 feat: support end-to-end KT LoRA serving for Qwen3.5 MoE (#53) Jiaheng Dai 2026-06-05 15:37:28 +08:00
6467cf5b24 feat(kse): port unified-sparse-kv KVCache Sparsity Engine from magicYang1573/sglang copilot/move-unified-sparse-kv copilot-swe-agent[bot] 2026-05-26 08:11:06 +00:00
d7567978e5 debug: add more NaN detection points + combined GPU/CPU report fix/nan-4090 yyj 2026-05-15 13:57:31 +08:00
8546927613 debug: add NaN detection in MoE internals yyj 2026-05-15 13:40:43 +08:00
83207724ce debug: add NaN detection points for 4090 debugging yyj 2026-05-15 13:26:27 +08:00
ebaff7729b fix: regressions (scheduler hang, cuda graph TypeError, MXFP4 cache, rsf double-apply) (#50) Benjamin F 2026-05-14 14:00:27 +08:00
9fa1de6bce fix: remove undefined _GraphBucket reference in cuda graph replay pr50 yyj 2026-05-14 05:22:32 +00:00
91277ffd6e Merge pull request #1 from yyj6666667/fix/scheduler-req-pool-regression Benjamin F 2026-05-13 17:36:44 +08:00
2763727f30 fix(cuda_graph): use out-of-band _replay_forward_batch for non-DSV4 backends fix/scheduler-req-pool-regression yyj 2026-05-13 17:18:29 +08:00
de9d7bf83a fix(scheduler): revert PR #38 req_pool changes that break TP-only mode yyj 2026-05-13 16:38:51 +08:00
bedcff3786 fix(v4-flash): remove broken MXFP4 weight cache + fix rsf double-apply yyj 2026-05-11 21:09:52 +08:00
00126648e2 [PD] Add EFA disaggregation transport support_multi_protocol Teng Ma 2026-05-10 23:51:29 +08:00
c49ee54e73 Merge remote-tracking branch 'origin/main' into pr-21859-support-multi-protocol Teng Ma 2026-05-10 22:35:49 +08:00
335dbd60b4 Support Intern-S2-Preview (#24875) RunningLeon 2026-05-10 22:17:30 +08:00
59faf986b2 [PD] Unify dsv4 dispatch with swa (#24888) Ke Bao 2026-05-10 22:01:13 +08:00
2f06867128 Optimize MHC pipeline: DeepGemm, fused norm, fused hc_head (#24775) Yuhao Yang 2026-05-10 19:03:37 +08:00
bd0aa22309 Fix PD bootstrap failure handling (#24772) Yuhao Yang 2026-05-10 19:02:47 +08:00
8cc16c9974 [Spec] Cleanup idle stub and shape-check patterns (#24881) Liangsheng Yin 2026-05-10 02:39:53 -07:00
c7f674e427 [Bug] Add dsv4 state_type branch to mooncake disaggregation (#24878) Cheng Wan 2026-05-10 01:13:46 -07:00
d08744238a [Spec V1] Split draft-extend phase from EagleDraftInput into new EagleDraftExtendInput (#24859) Liangsheng Yin 2026-05-10 01:07:45 -07:00
d3fd91ed97 [Gemma4] Optimize Gemm4 with fused Q/K/V RMSNorm + per-expert FP8 ckpt loader (#24696) Yuan Luo 2026-05-10 15:24:12 +08:00
a87fb399de [spec decoding] support kimi-k2.5-eagle3-mla (#24826) Qiaolin Yu 2026-05-09 23:57:39 -07:00
b4d347e86e [SPEC V2] fix: skip stale state updates in spec-v2 overlap (#23456) shuwenn 2026-05-10 14:56:24 +08:00
cfd3fd00d0 [RL] Call torch.cuda.empty_cache() for in-place pause mode to avoid OOM (#24854) Byron Hsu 2026-05-09 23:36:52 -07:00
44efc23a9a [diffusion] CI: add cache-dit CI tests (#19213) Chi McIsaac 2026-05-10 01:38:41 -04:00
1e6c6d1f07 [Utils] Make request dump robust to unpicklable server_args and large meta_info (#24767) Byron Hsu 2026-05-09 21:41:41 -07:00
9578ba1b57 [Utils] Refactor device cache emptying (#24861) Stefan He 2026-05-09 21:28:00 -07:00
47483001b6 [PrefillDelayer] support NCCL all-gather for cross-DP info sync (#24768) Byron Hsu 2026-05-09 21:20:03 -07:00
7edb4c3cea [NUMA+Ray] Fix NUMA NVML handle resolution under shuffled CUDA_VISIBLE_DEVICES (#24766) Byron Hsu 2026-05-09 21:18:39 -07:00
f9c315e85d docs: clarify how /tag-and-rerun-ci kicks off CI on the current commit (#24774) Byron Hsu 2026-05-09 20:28:41 -07:00
c95454b341 speculative: drop dead params/returns/no-ops (#24865) Liangsheng Yin 2026-05-09 15:53:31 -07:00
b735ca178c Update CODEOWNERS for /sgl-kernel/csrc/musa (#24746) R0CKSTAR 2026-05-10 05:45:13 +08:00
12f42f2e7e Support Gemma3/4 + Eagle3 (#23976) Charles Chen 2026-05-09 13:34:56 -07:00
8087e07d52 [UnifiedRadixTree]: Align cache_empty_result with RadixTree (#24779) luchangli 2026-05-09 23:52:22 +08:00
ef5e9f8aba [DSV4] Cherry pick missing commits from deepseek_v4 branch and enhance tests (#24793) Baizhou Zhang 2026-05-09 04:15:37 -07:00
4b23f6bdc5 Fix performance regression on Deepseek V3 on moe-runner-backend=triton on SM90 (#24562) Brayden Zhong 2026-05-09 06:49:12 -04:00
05d1ab51e8 Enable PDL for various kernels in DSV32/GLM5 (#23965) Brayden Zhong 2026-05-09 06:42:56 -04:00
d5564c2a96 fix(fa3): translate page table to SWA loc in EAGLE3 topk>1 spec metadata (#24617) shuwenn 2026-05-09 18:22:45 +08:00
a309f1f8f4 fix(cuda_graph): zero out_cache_loc_swa on pad and use int32 (hybrid-SWA accuracy fix) (#24743) JoyFuture 2026-05-09 18:22:12 +08:00
ba625d5290 slash command rerun UX: emoji semantics + result writeback (#24802) Liangsheng Yin 2026-05-09 03:19:24 -07:00
f4b7e73699 Enable trtllm-gen BF16 MoE for MTP (#24260) Brayden Zhong 2026-05-09 06:14:17 -04:00
f1a9a455e0 Revert "[NPU] fix profiler on npu" (#24815) sglang-npu-bot 2026-05-09 17:53:02 +08:00
e2527df8b6 [NPU] fix profiler on npu (#24685) zhaozx-cn 2026-05-09 17:48:24 +08:00
fd636410a2 Restrict fa_skip_kv_cache to non-MLA backends (#24097) Jia Guo 2026-05-09 02:25:02 -07:00
8f33bee31b Reland Cute-DSL FP4 dense GEMM (#23590) Brayden Zhong 2026-05-09 05:20:58 -04:00
43ed1ec77a refactor(dsv4): isolate DeepSeek V4 Flash behind plugin registries (#47) Benjamin F 2026-05-09 16:33:18 +08:00
d49fc092cb [Bug Fix] GLM-5.1: drop constexpr on page_indice_batch_offset, skip offloader post_init on draft worker, support N=32 in copy_to_gpu_no_ce (#23550) Yuxuan Zhang 2026-05-09 09:43:45 +02:00
9d12f9e6fa [HiCache] ci: lower est_time for test_hicache_spec_file_storage (#24713) shuwenn 2026-05-09 15:33:18 +08:00
78da0d3106 [Spec] Move accept_tokens off EagleDraftInput; pass via method arg (#24735) Liangsheng Yin 2026-05-08 23:24:18 -07:00
1610aa77ab Reduce gemma4 moe deterministic test runtime (#24754) Khoa Pham 2026-05-08 20:46:56 -07:00
8e534e8f15 [diffusion] fix: fix diffusers executor crash when component residency manager is absent (#24573) Chi McIsaac 2026-05-08 23:45:06 -04:00
44a527f6f4 fix patch_torch test queue race (#24739) Liangsheng Yin 2026-05-08 20:25:59 -07:00
590b13b513 [diffusion] fix: fix NCCL deadlock in ulysses sp when sequence length has remainder (#24694) storyicon 2026-05-09 11:05:37 +08:00
50ed01674e fix is_arch_support_pdl function usage (#24600) Polisetty V R K Jyothendra Varma 2026-05-09 07:09:34 +05:30
1613bae412 [Spec] Disambiguate verified_id into bonus_token(s) / accept_tokens (#24724) Liangsheng Yin 2026-05-08 18:24:33 -07:00
a61a14f416 [KDA] Optimize prefill kernels with diagonal and recompute fuse (#24271) Yuan Luo 2026-05-09 08:52:51 +08:00
9ee830346f Disable Custom AR V2 when in multi-node (#24729) Brayden Zhong 2026-05-08 20:50:05 -04:00
d1c5937428 env: add SGLANG_RADIX_FORCE_MISS to force radix prefix-cache miss (#24726) Cheng Wan 2026-05-08 17:46:38 -07:00
560829a171 feat(scheduler): add adaptive queue-based prefill delayer trigger (#23189) YAMY 2026-05-08 16:54:30 -07:00
6971a03fe6 fix(fa3): skip scheduler_metadata precompute under DP attention (#24632) YAMY 2026-05-08 16:19:20 -07:00
62c2e091f6 [PD] MORI-IO: Add state transfer, inline transfer model, and high-concurrency fixes (#22665) Niko Ma 2026-05-09 07:07:22 +08:00
190b15c8fe [AMD] Register 8 CPU-bound unit tests for AMD 1-GPU PR CI (#24569) Michael 2026-05-09 07:01:58 +08:00
5fbec0e445 ci: prune per-commit CUDA tests — move 25 files + 13 testcases to test/manual/ (#24721) Alison Shao 2026-05-08 15:53:23 -07:00
aefd8e257f Re-land #23109: rebase-required mode + fix for grep-no-match abort (#24180) Alison Shao 2026-05-08 15:28:57 -07:00
fa8985486e [test/fix]: isolate VLM MMMU eval output dirs to fix nightly-4-gpu cross-test pollution (#24623) Jimmy Shong 2026-05-08 17:01:53 -05:00
5dc4c7bef1 Add speculative decoding naming convention rule (#24094) Liangsheng Yin 2026-05-08 14:52:31 -07:00
096ad02b06 [Model] Laguna-XS.2 Model Support (#24204) Jimmy Shong 2026-05-08 16:43:13 -05:00
7b707c9222 disable the combination of --enable-two-batch-overlap and --enforce-s… (#24720) Cheng Wan 2026-05-08 14:27:35 -07:00
09912fd89d Remove unnecessary bf16 assert in rotate_activation (#24686) Yuhao Yang 2026-05-09 05:00:52 +08:00
f30d1d0b0a logits: remove blocking H2D copy (#24627) Yilong Zhao 2026-05-08 13:22:13 -07:00
672f778512 [NemotronH] Fix expert scale weight loading (#24434) Ethan Feng 2026-05-09 03:37:06 +08:00
2cf1a4ab38 feat: Add KV events for Mamba radix cache (#23678) zhongdaor-nv 2026-05-08 11:53:36 -07:00
ca7a8cc61d [Bugfix] Fix a bug causing NVFP4 to be tested on all gpus like SM90 devices. (#24604) Xu Zou 2026-05-09 02:51:30 +08:00
e40e339c72 Filter non-int token ids in benchmark and observe decode-side bootstrap/alloc metrics (#24684) Lianmin Zheng 2026-05-08 11:45:37 -07:00
73b8eda103 [diffusion] fix: fix FA3 varlen out argument handling (#24688) Mick 2026-05-08 19:01:49 +08:00
17888fa92a [diffusion] doc: update ltx2 multi-gpu deployment guide (#24682) Mick 2026-05-08 18:38:05 +08:00
7f8e7a9130 fix(aiter): drop FP8 KV upcast; use native FP8 path in paged_attentio… (#24129) fanxingran 2026-05-08 17:47:48 +08:00
f21d4868dc [AMD] Replace naive triton RMSNorm with aiter RMSNorm for diffusion model (#24360) jacky.cheng 2026-05-08 17:44:13 +08:00
e1150f66db [AMD][diffusion] Temporal-unfolded batched Conv2D for ROCm VAE decode (#22971) YC Yen-Ching Tseng 2026-05-08 17:32:14 +08:00
d32e283947 [NPU] [DOC] refresh npu supported model list (#24676) amote-i 2026-05-08 17:08:15 +08:00
80d0226b68 Turn on JIT custom AR implementation by default (#24363) Brayden Zhong 2026-05-08 05:05:31 -04:00
73792629d4 [AMD] Intro SGLANG_DIFFUSION_AITER_FP8_ATTN (#24677) HAI 2026-05-08 01:31:00 -07:00
76a1f169b3 [AMD] Add AMD FP8 MLA attention test for Wan2.2-T2V-A14B (#23955) jacky.cheng 2026-05-08 16:03:51 +08:00
b22d3cd606 [AMD] Support fp8 MLA for diffusion model (#20319) jacky.cheng 2026-05-08 15:56:24 +08:00
19afe73e03 [AMD] Cherry-pick aiter commit for mhc_pre fix (#24665) Thomas Wang 2026-05-08 14:49:39 +08:00
55d8223c2b [sgl-kernel/cpu] support w8a8 int8 model for arm cpu (#16045) Yibo Cai 2026-05-08 14:47:06 +08:00
47e9ec11ad [NPU] [DOC] fix ascend_npu_support_new_models TOC (#24658) amote-i 2026-05-08 14:07:00 +08:00
e1bc001872 fix(mimo_v2): auto-disable multimodal when vision/audio configs are absent (#24652) JoyFuture 2026-05-08 13:40:08 +08:00
7deed98e1b [fix] /pause_generation and /continue_generation wrong for --tokenizer-worker-num > 1 (#24462) maocheng23 2026-05-07 21:32:21 -07:00
2afb450501 [diffusion] optimize: optimize frame returns path (#24616) Mick 2026-05-08 12:10:09 +08:00
cdf5771f91 [MUSA][17/N] ci: Add MUSA diffusion, sgl-kernel tests, and CI workflow support (#20672) johnnycxm 2026-05-08 11:45:21 +08:00
15e6572f21 [MUSA][18/N] Add MUSA-optimized kernel implementations for hot ops (#23255) Joey 2026-05-08 11:38:33 +08:00

1 2 3 4 5 ...