Commit Graph

73 Commits

Author SHA1 Message Date
Yuhao Yang
2f06867128 Optimize MHC pipeline: DeepGemm, fused norm, fused hc_head (#24775)
Co-authored-by: Cheng Wan <chwan@rice.edu>
Co-authored-by: Chunan Zeng <zcnrex@gmail.com>
2026-05-10 19:03:37 +08:00
Baizhou Zhang
ef5e9f8aba [DSV4] Cherry pick missing commits from deepseek_v4 branch and enhance tests (#24793)
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: yueming-yuan <yym022502@gmail.com>
2026-05-09 04:15:37 -07:00
Liangsheng Yin
ba625d5290 slash command rerun UX: emoji semantics + result writeback (#24802) 2026-05-09 03:19:24 -07:00
Alison Shao
aefd8e257f Re-land #23109: rebase-required mode + fix for grep-no-match abort (#24180) 2026-05-08 15:28:57 -07:00
Alison Shao
094b90b1ec ci: drop 1-gpu-h100-h200 shared label (#24495) 2026-05-06 01:02:31 -07:00
Kangyan-Zhou
52b4609789 [Docker] Prep for torch 2.11: cu129 fix, image validator, dep cleanup (#23593)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2026-05-04 00:37:55 -07:00
Liangsheng Yin
53df43d0a3 rerun-test: route deepep h200 suite to deepep runner (#24325) 2026-05-03 15:57:53 -07:00
Mick
5925572c95 [diffusion] CI: switch CI data references to sgl-project/ci-data (#24299) 2026-05-03 23:05:12 +08:00
Mick
b7d4647568 [diffusion] CI: change ground truth repo (#24219) 2026-05-01 21:25:40 -07:00
Alec
9d95783603 Add Docker image provenance metadata (#24090)
Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>
2026-04-30 21:40:42 -07:00
Alison Shao
694ef516cb Revert "[ci] split stage-c-test-4-gpu-b200 to enable a low-disk runner pool" (#24163)
Co-authored-by: Alison Shao <alisonshao@radixark.ai>
2026-04-30 15:57:19 -07:00
Alison Shao
7bb7f6049a ci: add per-host utilization view to runner-utilization report (#24102) 2026-04-30 10:05:16 -07:00
Mick
0b1fbdba15 [diffusion] CI: change ground truth upload path and improve publish script (#24120) 2026-04-30 12:26:10 +08:00
Alex Nails
c3ab5bec7d ci: consolidate rust + protoc install across workflows (#23700)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 13:39:02 -07:00
Kangyan-Zhou
d4040e7010 [CI] Broaden stage-b-test-1-gpu-large runner pool to H100 + H200 (#24080)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 12:18:10 -07:00
shuwenn
03147f66b8 ci: add /rerun-group to rerun all registered tests in a group (#24023)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 10:24:16 -07:00
Mick
144038fbae [diffusion] chore: change default seed to 42 (#23836) 2026-04-28 20:39:23 +08:00
Mick
c0166355ae [diffusion] CI: minor refactor CI (#23576) 2026-04-24 08:48:31 +08:00
Kangyan-Zhou
c689f774a4 [CI] /rerun-stage: fix workflow-run URL lookup for sgl-kernel PRs (#23510)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 17:48:38 -07:00
Kangyan-Zhou
14ac14287c [CI] /rerun-stage: auto-include wheel build when PR modifies sgl-kernel/ (#23492)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 11:28:06 -07:00
Jia Guo
286fba2073 ci: use rerun_failed_jobs for skipped workflows in /rerun-failed-ci (#23008)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 23:59:15 -07:00
Alison Shao
04b1caf75b ci: enable /rerun-test for multimodal gen PR tests (#22828) 2026-04-21 21:34:14 -07:00
Yuhao Yang
f41f1a74a4 [diffusion] chore: support custom output folder name in GT generation workflow (#23422) 2026-04-22 11:18:21 +08:00
Kangyan-Zhou
77fd86f89e [ci] split stage-c-test-4-gpu-b200 to enable a low-disk runner pool (#23417)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 18:33:33 -07:00
Alison Shao
0e165ffbfc ci: enable /rerun-test for nightly test suites (#22830) 2026-04-21 18:28:10 -07:00
Liangsheng Yin
6cc2eee50d [misc] CI hygiene: enforce __main__ entry, drop silent-skipped tests, fix rerun-test protoc (#23305) 2026-04-20 21:16:24 -07:00
Mick
5de89ea942 [diffusion] CI: fix auto-partition (#23076) 2026-04-17 22:37:24 +08:00
Alex Nails
43eb66028f ci: install rust toolchain in ci_install_dependency.sh (#23017) 2026-04-16 23:18:22 -07:00
Bingxu Chen
7ac337df94 [AMD] CI Job Monitor: fix queue time, utilization, and summary metrics (#22274)
Co-authored-by: bingxche <binxche@amd.com>
2026-04-16 22:03:37 -07:00
Mick
80718492dd [diffusion] CI: reset thresholds (#22854) 2026-04-15 21:11:00 +08:00
Mick
e95c2e73bd [diffusion] CI: refactor diffusion ci and reduce redundancy (#22810) 2026-04-15 10:12:29 +08:00
Mick
c5e95080d2 [diffusion] model: support Ltx 2.3 two stage ti2v (#22667) 2026-04-14 22:10:08 +08:00
Jia Guo
bc16130a17 ci: skip full rerun when sgl-kernel wheel already built (#22534)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-13 20:32:55 -07:00
Mick
d524f110ac [diffusion] refactor: streamline denoising stages (#22633) 2026-04-13 13:34:37 +08:00
Prozac614
45472d70cc [diffusion] CI: dynamic load-balanced partitioning for diffusion CI (#15528)
Co-authored-by: daiweitao <dwti614707404@163.com>
Co-authored-by: SGLang CI <ci@sglang.ai>
2026-04-12 13:02:43 +08:00
Ratish P
cf5ad12612 [diffusion][CI]: route multimodal component accuracy through run_suite (#21960) 2026-04-10 23:06:03 +08:00
Mick
efee62efa6 [diffusion] CI: improve diffusion comparison benchmark setting for realistic perf and auto-discover ut (#22086)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-04 23:20:37 +08:00
Xiaoyu Zhang
da25b471e3 Align diffusion nightly presets and broaden skill discovery (#22099) 2026-04-04 21:43:52 +08:00
Prozac614
db3d4f4b76 [diffusion] model: support two stage pipeline of LTX-2 (#20707)
Co-authored-by: daiweitao <dwti614707404@163.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: GMI Xiao Jin <xiao.j@gmicloud.ai>
2026-04-04 09:37:28 +08:00
Liangsheng Yin
5118295f7b [CI] Support CPU stage and auto-batch same-stage files in /rerun-test (#22081) 2026-04-03 15:56:54 -07:00
Mick
838f815e9f [diffusion] CI: temporarily disable accuracy ci (#22031) 2026-04-03 17:39:29 +08:00
Prozac614
24997fe42c [diffusion] CI: add initial nvfp4 ci test for b200 (#21767)
Co-authored-by: Mick <mickjagger19@icloud.com>
2026-04-02 11:31:08 +08:00
Ratish P
4f5b55e379 [diffusion][CI]: Add individual component accuracy CI for diffusion models (#18709)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2026-04-01 21:51:36 +08:00
Ke Bao
acd37d8701 [CI] Fix rerun-test suite detection to skip commented registrations (#21753) 2026-03-31 18:00:53 +08:00
Ke Bao
2456889f98 Rename rerun-ut to rerun-test (#21747) 2026-03-31 17:31:55 +08:00
Mick
db5d9eb8ce [diffusion] CI: fix dashboard chart (nightly) display issues (#21653)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-30 12:02:01 +08:00
Bingxu Chen
6047d2c690 [AMD] Fix AMD CI monitor GitHub API rate limit exhaustion (#21527) 2026-03-27 02:55:56 -07:00
Liangsheng Yin
e1ee68d0fc Release mm features on session close and support multiple /rerun-ut specs (#21501) 2026-03-26 18:31:29 -07:00
Liangsheng Yin
9dc266adb4 Fix concurrent /rerun-ut posting duplicate workflow URLs (#21495) 2026-03-26 16:26:00 -07:00
Mick
238a4b8f8f [diffusion] CI: fix breaking import path in nightly (#21449) 2026-03-26 16:33:22 +08:00