Commit Graph

558 Commits

Author SHA1 Message Date
Liangsheng Yin
6cc2eee50d [misc] CI hygiene: enforce __main__ entry, drop silent-skipped tests, fix rerun-test protoc (#23305) 2026-04-20 21:16:24 -07:00
Cheng Wan
ebcc2b3eec ci: run weekly est_time update on Monday using p90 of last 15 runs (#23120)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 14:39:27 -07:00
Baizhou Zhang
6ecd6f84db [CI] Add per-job uv venv isolation and upgrade CI version to Cuda 13 (#23119)
Co-authored-by: Kangyan Zhou <zky314343421@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Alison Shao <a.shao@wustl.edu>
Co-authored-by: Mick <mickjagger19@icloud.com>
2026-04-19 05:32:36 -07:00
Mick
5de89ea942 [diffusion] CI: fix auto-partition (#23076) 2026-04-17 22:37:24 +08:00
YC Yen-Ching Tseng
f399997d2f [AMD] mirror nightly images to local registry and prefer LAN pulls (#23073)
Co-authored-by: bingxche <bingxche@amd.com>
2026-04-17 19:49:26 +08:00
Alex Nails
43eb66028f ci: install rust toolchain in ci_install_dependency.sh (#23017) 2026-04-16 23:18:22 -07:00
Bingxu Chen
7ac337df94 [AMD] CI Job Monitor: fix queue time, utilization, and summary metrics (#22274)
Co-authored-by: bingxche <binxche@amd.com>
2026-04-16 22:03:37 -07:00
ishandhanani
761259448d ci: re-enable fp8 nightly benchmark configs (#22910)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 15:57:49 -07:00
ishandhanani
2b0f349927 ci: clarify srt-slurm issue filing for incompatible flag combos (#22903)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 15:02:26 -07:00
ishandhanani
9497001b0c ci: add issue filing and suspect PR identification to log analyzer (#22899)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 14:27:14 -07:00
ishandhanani
f61c332cba ci: log analyzer (#22859) 2026-04-15 14:10:00 -07:00
Mick
80718492dd [diffusion] CI: reset thresholds (#22854) 2026-04-15 21:11:00 +08:00
Mick
e95c2e73bd [diffusion] CI: refactor diffusion ci and reduce redundancy (#22810) 2026-04-15 10:12:29 +08:00
Mick
c5e95080d2 [diffusion] model: support Ltx 2.3 two stage ti2v (#22667) 2026-04-14 22:10:08 +08:00
Baizhou Zhang
8fe9bbffb6 [CI] Reinstall flashinfer-jit-cache on CUDA version mismatch (#22741)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 23:04:23 -07:00
Jia Guo
bc16130a17 ci: skip full rerun when sgl-kernel wheel already built (#22534)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-13 20:32:55 -07:00
Baizhou Zhang
b441317aa4 Revert "Upgrade CI default CUDA version from 12.9 to 13.0" (#22727) 2026-04-13 14:39:24 -07:00
Mick
d524f110ac [diffusion] refactor: streamline denoising stages (#22633) 2026-04-13 13:34:37 +08:00
Alison Shao
3f4fbc165d Upgrade CI default CUDA version from 12.9 to 13.0 (#21441) 2026-04-12 21:48:40 -07:00
Mohammad Miadh Angkad
701a0e0c25 [CI/Docker] Clean up redundant flashinfer cubin downloads (#22491) 2026-04-12 12:30:41 -07:00
Prozac614
45472d70cc [diffusion] CI: dynamic load-balanced partitioning for diffusion CI (#15528)
Co-authored-by: daiweitao <dwti614707404@163.com>
Co-authored-by: SGLang CI <ci@sglang.ai>
2026-04-12 13:02:43 +08:00
Alison Shao
f21d23a211 ci: use local NVIDIA wheels to avoid re-downloading ~2GB every CI run (#22602)
Co-authored-by: Alison Shao <alison.shao@MacBook-Pro-D2W773R9CD.local>
2026-04-11 21:32:51 -07:00
Alison Shao
870a21bf39 [CI] Remove Slack bot from CI failure monitor (#21581)
Co-authored-by: Alison Shao <alison.shao@Mac.attlocal.net>
2026-04-11 20:34:48 -07:00
Bingxu Chen
213027951a [AMD] Upgrade Aiter (#22264) 2026-04-10 18:40:43 -07:00
Cheng Wan
b5e4ae7b1a fix: match est_time updates by backend, not just suite (#22563) 2026-04-10 17:54:50 -07:00
Cheng Wan
0011d2aec0 fix: track est_time per suite instead of per backend (#22557) 2026-04-10 16:58:40 -07:00
Sahithi Chigurupati
451320596f [CI] Add GB200 nightly perf regression pipeline (#22461) 2026-04-10 15:12:24 -07:00
Cheng Wan
3f39b3d811 feat: add weekly workflow to update CI test est_time values (#22545)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-10 15:03:37 -07:00
Ratish P
cf5ad12612 [diffusion][CI]: route multimodal component accuracy through run_suite (#21960) 2026-04-10 23:06:03 +08:00
tfhddd
c431b11d8b [CI] Use UV to improve pip install speed (#22029) 2026-04-09 09:18:32 +08:00
Alison Shao
e41647f52b [CI] Add pre-commit hook to validate test/registered/ files have CI registry (#22308)
Co-authored-by: Alison Shao <alison.shao@Mac.attlocal.net>
Co-authored-by: Alison Shao <alison.shao@MacBook-Pro-D2W773R9CD.local>
2026-04-08 15:59:15 -07:00
Rain Jiang
1a8eb890f6 Kernels community fa3 (#20796) 2026-04-07 12:48:44 -07:00
Kangyan-Zhou
596c34ee04 Update ci_auto_bisect.py to have streak 1 so that all failures will b… (#22161)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 10:39:19 -07:00
Kangyan-Zhou
edee9ae929 Update ci_auto_bisect.py to use correct model (#22142) 2026-04-04 23:57:52 -07:00
Kangyan-Zhou
8cbeacd783 feat: CI auto-bisect workflow for automated regression analysis (#22119)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 18:58:18 -07:00
Mick
efee62efa6 [diffusion] CI: improve diffusion comparison benchmark setting for realistic perf and auto-discover ut (#22086)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-04 23:20:37 +08:00
Xiaoyu Zhang
da25b471e3 Align diffusion nightly presets and broaden skill discovery (#22099) 2026-04-04 21:43:52 +08:00
Prozac614
db3d4f4b76 [diffusion] model: support two stage pipeline of LTX-2 (#20707)
Co-authored-by: daiweitao <dwti614707404@163.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: GMI Xiao Jin <xiao.j@gmicloud.ai>
2026-04-04 09:37:28 +08:00
Liangsheng Yin
5118295f7b [CI] Support CPU stage and auto-batch same-stage files in /rerun-test (#22081) 2026-04-03 15:56:54 -07:00
Mick
838f815e9f [diffusion] CI: temporarily disable accuracy ci (#22031) 2026-04-03 17:39:29 +08:00
Duyi-Wang
ac593fed90 [AMD][Dockerfile] Support build-arg AITER_COMMIT for rocm.Dockerfile (#21949) 2026-04-03 01:54:28 -07:00
monkeyLoveding
658a2813d8 [NPU] Update CI Dependency (#21578) 2026-04-03 16:22:11 +08:00
Liangsheng Yin
4cc970290d [CI] Fix duplicate job names that bypass branch protection (#22001) 2026-04-02 23:59:35 -07:00
Feng Su
8732b2e9c6 [CI] [Tracing] Add ci for tracing and fix bugs (#21740) 2026-04-02 10:50:50 -07:00
David Cheung
ed427e1299 Migrate all callers from /get_server_info to /server_info (#21463) 2026-04-01 21:17:50 -07:00
Prozac614
24997fe42c [diffusion] CI: add initial nvfp4 ci test for b200 (#21767)
Co-authored-by: Mick <mickjagger19@icloud.com>
2026-04-02 11:31:08 +08:00
Shangming Cai
7004df6094 chore: bump mooncake version to 0.3.10.post1 (#21844) 2026-04-02 10:54:22 +08:00
Noa Neria
8d9145d97e Direct model loading from object storage with Runai Model Streamer (#17948)
Signed-off-by: Noa Neria <noa@run.ai>
2026-04-01 18:41:22 -07:00
Ratish P
4f5b55e379 [diffusion][CI]: Add individual component accuracy CI for diffusion models (#18709)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2026-04-01 21:51:36 +08:00
Douglas Yang
1b45d81e91 fix: only showing recent runners from ci failure analysis (#21015) 2026-03-31 20:18:17 -07:00