Commit Graph

200 Commits

Author SHA1 Message Date
Alison Shao
f9c3def7fe Fix CI: add flashinfer --download-cubin to install dependencies (#18887)
Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>
2026-02-16 13:50:10 -08:00
Douglas Yang
f1efb46bdd fix: adding performance logging for nightly diffusion (#18023) 2026-02-16 14:09:00 +08:00
SoluMilken
07a24f1a38 update pre-commit config (#18860) 2026-02-16 00:18:31 +08:00
Kangyan-Zhou
eccf875d49 [CI] Revive 8-GPU trace upload in nightly test workflow (#18820)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 08:37:08 +08:00
Mohammad Miadh Angkad
1be41e9036 [FlashInfer] Bump FlashInfer version from 0.6.2 to 0.6.3 (#18448) 2026-02-14 07:43:33 +08:00
Ke Bao
a6c4b52ac5 Cleanup unused rerun stages (#18788) 2026-02-13 17:44:42 +08:00
Kangyan-Zhou
1b8f68af57 Fix B200 installation issue (#18725) 2026-02-12 22:06:23 +08:00
Alison Shao
f20b1703ce [CI] Fix torchaudio/torchvision CUDA version mismatch (#18211) 2026-02-11 23:47:32 -08:00
YC Tseng
20554a0a4f [AMD] rocm 7.2 image release, PR test, Nightly Test (#17799)
Co-authored-by: Alan Kao <akao@amd.com>
Co-authored-by: bingxche <Bingxu.Chen@amd.com>
Co-authored-by: Michael <13900043+michaelzhang-ai@users.noreply.github.com>
2026-02-11 21:29:25 -08:00
Alison Shao
7eaf866846 [CI] Install python3-dev for Triton JIT compilation on fresh runners (#18644) 2026-02-11 16:28:57 -08:00
Alison Shao
dcc63dc545 [CI] Guard python3 call in install script for fresh runners (#18609) 2026-02-12 00:05:29 +08:00
Bingxu Chen
316f9cbb35 [AMD] add amd ci monitor (#17476)
Co-authored-by: michaelzhang-ai <michaelzhang-ai@users.noreply.github.com>
Co-authored-by: YC Tseng <yctseng@amd.com>
2026-02-09 09:04:54 -08:00
YC Tseng
28717e3d28 [AMD] CI - Fix AMD daily image release and install dependency (#18452)
Co-authored-by: Bingxu Chen <bingxche@amd.com>
2026-02-08 22:20:09 -08:00
Bingxu Chen
3f3c201243 [AMD] Update aiter to v0.1.10.post2 (#18423)
Co-authored-by: kkHuang-amd <wunhuang@amd.com>
Co-authored-by: YC Tseng <yctseng@amd.com>
2026-02-08 22:08:24 -08:00
Shangming Cai
52401bec1d chore: bump mooncake version to 0.3.9 (#18316)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
2026-02-07 17:30:01 +08:00
Alison Shao
bedade1ef0 Merge stage-c-test-large-4-gpu suites into partitioned suites (#18325) 2026-02-06 15:32:33 -08:00
Zhaoyi Li
8e933e1914 AMD PD/D PR ci (#17183)
Co-authored-by: YC Tseng <yctseng@amd.com>
Co-authored-by: Bingxu Chen <bingxche@amd.com>
Co-authored-by: bingxche <Bingxu.Chen@amd.com>
2026-02-02 23:29:14 -08:00
sunxxuns
47592a23c7 [CI] Fix AMD CI by inlining dummy_grok config (#18044)
Co-authored-by: root <root@mi300x8-005.atl1.do.cpe.ice.amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-01 00:20:57 -08:00
Kangyan-Zhou
e5ac6229e1 Fix installation script for H200 runners (#18050) 2026-01-31 23:30:51 -08:00
Alison Shao
a0bae4c343 Migrate 4-GPU/8-GPU workflow jobs to stage-c and add CI registry decorators (#17299) 2026-01-31 22:37:22 -08:00
Kangyan-Zhou
2cd2c3118d Add concurrency tracking to runner utilization report (#17963)
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 17:31:55 -08:00
Alison Shao
1f75c2af4d Fix /tag-and-rerun-ci to do full rerun when PR has sgl-kernel changes (#17729) 2026-01-29 12:54:30 -08:00
Kangyan-Zhou
c0b4dd68a2 Add a performance dashboard server and frontend for nightly CUDA tests (#17725) 2026-01-27 22:22:33 -08:00
YC Tseng
52bca42870 [AMD] CI - enable deepseekv3.2 on MI325-8gpu and merge perf/accuracy test suites into stage-b suites (#17633)
Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>
2026-01-27 18:54:36 -08:00
Hubert Lu
93423ff780 [AMD] Deprecate ROCm 6.3 artifacts and standardize gfx942 on ROCm 7 (#17785) 2026-01-27 15:58:49 -08:00
monkeyLoveding
d578b41bad [NPU] Adapt cann 8.5: use sfa and lightning indexer op from cann and CI update (#17615)
Co-authored-by: Kelon <kelonlu@163.com>
2026-01-27 19:03:53 +08:00
Makcum888e
bba6e38ff8 [NPU] Split pyproject npu from pyproject other (#17641) 2026-01-26 09:45:44 -08:00
shaharmor98
f6f1b6d000 Bump FI version (#17700)
Signed-off-by: Shahar Mor <smor@nvidia.com>
Co-authored-by: b8zhong <b8zhong@uwaterloo.ca>
2026-01-26 16:50:06 +08:00
Alison Shao
7b22b8ff8a Fix sgl-kernel install: fail instead of PyPI fallback when artifacts missing (#17728) 2026-01-26 11:46:49 +08:00
Kangyan-Zhou
344eeaee90 Upload nightly test metrics to GH artifacts (#17696) 2026-01-25 14:35:14 -08:00
Makcum888e
d1042e0d62 [Refactore] [CI] Remove redundant CI test runs step 2 (#17584) 2026-01-24 23:39:48 -08:00
Alison Shao
b23470e95a Fix CI install failure when rerunning tests via workflow_dispatch (#17612) 2026-01-23 00:04:16 -08:00
YC Tseng
04a10c9bc2 [AMD] CI - migrate perf test and fix stage-b-test-1-gpu-amd (#17340)
Co-authored-by: Bingxu Chen <bingxche@amd.com>
Co-authored-by: bingxche <Bingxu.Chen@amd.com>
Co-authored-by: michaelzhang-ai <michaelzhang.ai@users.noreply.github.com>
2026-01-22 18:45:05 -08:00
Michael
a3addd6203 [AMD] Add DeepSeek-V3.2 and VLMs model in nightly tests (#17179)
Co-authored-by: michaelzhang-ai <michaelzhang-ai@users.noreply.github.com>
Co-authored-by: YC Tseng <yctseng@amd.com>
Co-authored-by: Bingxu Chen <bingxche@amd.com>
2026-01-19 20:31:56 -08:00
Alison Shao
fb88fb672e fix(ci): rate limit and permission errors in trace publishing (#17238) 2026-01-18 23:20:22 -08:00
Alison Shao
7edb06158e Add runner utilization report workflow (#17234) 2026-01-17 19:28:05 -08:00
fzyzcjy
a7b5f75d88 Support integration tests with Redis binary (#17045) 2026-01-17 11:59:04 +08:00
Alison Shao
b4fce9955a Add CI Coverage Overview workflow with detailed test listings (#16842) 2026-01-16 09:42:50 -08:00
Baizhou Zhang
a04675892e Update flashinfer to 0.6.1 (#15551) 2026-01-17 00:48:30 +08:00
YC Tseng
968c4f55b1 [AMD] Enable DeepseekV3.2 test for AMD CI (#16934) 2026-01-15 21:58:46 -08:00
Hudson Xing
21ee597e4a ci: enable offline mode when local cache is complete to avoid HF Hub … (#16121) 2026-01-15 20:15:33 -08:00
Alison Shao
146b5fcc84 [CI] Reorganize stage-b 1-GPU tests for 5090 compatibility (#16826) 2026-01-15 15:23:35 -08:00
Bingxu Chen
98096b5e02 [AMD CI] migrate and re-enable CI tests to new CI registry (#16949)
Co-authored-by: yctseng0211 <yctseng@amd.com>
2026-01-14 21:25:25 -08:00
Alison Shao
b880607108 Add 5090 dry run stage to PR test workflow (#17022) 2026-01-13 14:12:33 -08:00
James
ae0baefb94 [NPU] upgrade npu mf_apater plugin (#15853) 2026-01-13 09:02:10 +08:00
Alison Shao
17cb3c8e49 Enable /rerun-stage workflow URL lookup for fork PRs (#16851) 2026-01-11 23:05:37 +08:00
Alison Shao
9c64a15ad4 feat: add workflow run URL to /rerun-stage comment (#16825) 2026-01-10 10:41:20 +08:00
Alison Shao
ef35d8fe4e Migrate VLM tests and remove unit-test-backend-1-gpu job (#16679) 2026-01-09 15:24:26 -08:00
Shangming Cai
0c4e155a3c chore: bump mooncake version to 0.3.8.post1 (#16792)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
2026-01-09 18:42:27 +08:00
YC Tseng
ccd0fb3291 [AMD] Change AITER package name (#16721) 2026-01-08 21:17:20 -08:00