Liangsheng Yin
6cc2eee50d
[misc] CI hygiene: enforce __main__ entry, drop silent-skipped tests, fix rerun-test protoc ( #23305 )
2026-04-20 21:16:24 -07:00
Cheng Wan
ebcc2b3eec
ci: run weekly est_time update on Monday using p90 of last 15 runs ( #23120 )
...
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-19 14:39:27 -07:00
Baizhou Zhang
6ecd6f84db
[CI] Add per-job uv venv isolation and upgrade CI version to Cuda 13 ( #23119 )
...
Co-authored-by: Kangyan Zhou <zky314343421@gmail.com >
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
Co-authored-by: Alison Shao <a.shao@wustl.edu >
Co-authored-by: Mick <mickjagger19@icloud.com >
2026-04-19 05:32:36 -07:00
Mick
5de89ea942
[diffusion] CI: fix auto-partition ( #23076 )
2026-04-17 22:37:24 +08:00
YC Yen-Ching Tseng
f399997d2f
[AMD] mirror nightly images to local registry and prefer LAN pulls ( #23073 )
...
Co-authored-by: bingxche <bingxche@amd.com >
2026-04-17 19:49:26 +08:00
Alex Nails
43eb66028f
ci: install rust toolchain in ci_install_dependency.sh ( #23017 )
2026-04-16 23:18:22 -07:00
Bingxu Chen
7ac337df94
[AMD] CI Job Monitor: fix queue time, utilization, and summary metrics ( #22274 )
...
Co-authored-by: bingxche <binxche@amd.com >
2026-04-16 22:03:37 -07:00
ishandhanani
761259448d
ci: re-enable fp8 nightly benchmark configs ( #22910 )
...
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-15 15:57:49 -07:00
ishandhanani
2b0f349927
ci: clarify srt-slurm issue filing for incompatible flag combos ( #22903 )
...
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-15 15:02:26 -07:00
ishandhanani
9497001b0c
ci: add issue filing and suspect PR identification to log analyzer ( #22899 )
...
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-15 14:27:14 -07:00
ishandhanani
f61c332cba
ci: log analyzer ( #22859 )
2026-04-15 14:10:00 -07:00
Mick
80718492dd
[diffusion] CI: reset thresholds ( #22854 )
2026-04-15 21:11:00 +08:00
Mick
e95c2e73bd
[diffusion] CI: refactor diffusion ci and reduce redundancy ( #22810 )
2026-04-15 10:12:29 +08:00
Mick
c5e95080d2
[diffusion] model: support Ltx 2.3 two stage ti2v ( #22667 )
2026-04-14 22:10:08 +08:00
Baizhou Zhang
8fe9bbffb6
[CI] Reinstall flashinfer-jit-cache on CUDA version mismatch ( #22741 )
...
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-13 23:04:23 -07:00
Jia Guo
bc16130a17
ci: skip full rerun when sgl-kernel wheel already built ( #22534 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-04-13 20:32:55 -07:00
Baizhou Zhang
b441317aa4
Revert "Upgrade CI default CUDA version from 12.9 to 13.0" ( #22727 )
2026-04-13 14:39:24 -07:00
Mick
d524f110ac
[diffusion] refactor: streamline denoising stages ( #22633 )
2026-04-13 13:34:37 +08:00
Alison Shao
3f4fbc165d
Upgrade CI default CUDA version from 12.9 to 13.0 ( #21441 )
2026-04-12 21:48:40 -07:00
Mohammad Miadh Angkad
701a0e0c25
[CI/Docker] Clean up redundant flashinfer cubin downloads ( #22491 )
2026-04-12 12:30:41 -07:00
Prozac614
45472d70cc
[diffusion] CI: dynamic load-balanced partitioning for diffusion CI ( #15528 )
...
Co-authored-by: daiweitao <dwti614707404@163.com >
Co-authored-by: SGLang CI <ci@sglang.ai >
2026-04-12 13:02:43 +08:00
Alison Shao
f21d23a211
ci: use local NVIDIA wheels to avoid re-downloading ~2GB every CI run ( #22602 )
...
Co-authored-by: Alison Shao <alison.shao@MacBook-Pro-D2W773R9CD.local >
2026-04-11 21:32:51 -07:00
Bingxu Chen
213027951a
[AMD] Upgrade Aiter ( #22264 )
2026-04-10 18:40:43 -07:00
Cheng Wan
b5e4ae7b1a
fix: match est_time updates by backend, not just suite ( #22563 )
2026-04-10 17:54:50 -07:00
Cheng Wan
0011d2aec0
fix: track est_time per suite instead of per backend ( #22557 )
2026-04-10 16:58:40 -07:00
Sahithi Chigurupati
451320596f
[CI] Add GB200 nightly perf regression pipeline ( #22461 )
2026-04-10 15:12:24 -07:00
Cheng Wan
3f39b3d811
feat: add weekly workflow to update CI test est_time values ( #22545 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-04-10 15:03:37 -07:00
Ratish P
cf5ad12612
[diffusion][CI]: route multimodal component accuracy through run_suite ( #21960 )
2026-04-10 23:06:03 +08:00
tfhddd
c431b11d8b
[CI] Use UV to improve pip install speed ( #22029 )
2026-04-09 09:18:32 +08:00
Alison Shao
e41647f52b
[CI] Add pre-commit hook to validate test/registered/ files have CI registry ( #22308 )
...
Co-authored-by: Alison Shao <alison.shao@Mac.attlocal.net >
Co-authored-by: Alison Shao <alison.shao@MacBook-Pro-D2W773R9CD.local >
2026-04-08 15:59:15 -07:00
Rain Jiang
1a8eb890f6
Kernels community fa3 ( #20796 )
2026-04-07 12:48:44 -07:00
Mick
efee62efa6
[diffusion] CI: improve diffusion comparison benchmark setting for realistic perf and auto-discover ut ( #22086 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-04-04 23:20:37 +08:00
Xiaoyu Zhang
da25b471e3
Align diffusion nightly presets and broaden skill discovery ( #22099 )
2026-04-04 21:43:52 +08:00
Prozac614
db3d4f4b76
[diffusion] model: support two stage pipeline of LTX-2 ( #20707 )
...
Co-authored-by: daiweitao <dwti614707404@163.com >
Co-authored-by: Mick <mickjagger19@icloud.com >
Co-authored-by: GMI Xiao Jin <xiao.j@gmicloud.ai >
2026-04-04 09:37:28 +08:00
Liangsheng Yin
5118295f7b
[CI] Support CPU stage and auto-batch same-stage files in /rerun-test ( #22081 )
2026-04-03 15:56:54 -07:00
Mick
838f815e9f
[diffusion] CI: temporarily disable accuracy ci ( #22031 )
2026-04-03 17:39:29 +08:00
Duyi-Wang
ac593fed90
[AMD][Dockerfile] Support build-arg AITER_COMMIT for rocm.Dockerfile ( #21949 )
2026-04-03 01:54:28 -07:00
monkeyLoveding
658a2813d8
[NPU] Update CI Dependency ( #21578 )
2026-04-03 16:22:11 +08:00
Liangsheng Yin
4cc970290d
[CI] Fix duplicate job names that bypass branch protection ( #22001 )
2026-04-02 23:59:35 -07:00
Feng Su
8732b2e9c6
[CI] [Tracing] Add ci for tracing and fix bugs ( #21740 )
2026-04-02 10:50:50 -07:00
Prozac614
24997fe42c
[diffusion] CI: add initial nvfp4 ci test for b200 ( #21767 )
...
Co-authored-by: Mick <mickjagger19@icloud.com >
2026-04-02 11:31:08 +08:00
Shangming Cai
7004df6094
chore: bump mooncake version to 0.3.10.post1 ( #21844 )
2026-04-02 10:54:22 +08:00
Noa Neria
8d9145d97e
Direct model loading from object storage with Runai Model Streamer ( #17948 )
...
Signed-off-by: Noa Neria <noa@run.ai >
2026-04-01 18:41:22 -07:00
Ratish P
4f5b55e379
[diffusion][CI]: Add individual component accuracy CI for diffusion models ( #18709 )
...
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com >
2026-04-01 21:51:36 +08:00
Liangsheng Yin
53d8aa23ae
Cache nvidia wheels locally to skip repeated 830 MB downloads in CI ( #21778 )
2026-03-31 16:06:09 -07:00
Ke Bao
acd37d8701
[CI] Fix rerun-test suite detection to skip commented registrations ( #21753 )
2026-03-31 18:00:53 +08:00
Ke Bao
2456889f98
Rename rerun-ut to rerun-test ( #21747 )
2026-03-31 17:31:55 +08:00
Baizhou Zhang
d52757fe97
[CI]Remove msgm-en and mmlu tests which cause timeout ( #21733 )
2026-03-31 01:10:05 -07:00
Alison Shao
3650bfb199
Remove flashinfer wheel cache cleanup that deletes other versions ( #21711 )
...
Co-authored-by: Alison Shao <alison.shao@MacBook-Pro-D2W773R9CD.local >
2026-03-30 16:47:04 -07:00
Mick
db5d9eb8ce
[diffusion] CI: fix dashboard chart (nightly) display issues ( #21653 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-30 12:02:01 +08:00