Commit Graph

  • 7e1cb7b8cf Support cross-node CudaIPC qinghuazhou/alltoallv_kernel_multinode Qinghua Zhou 2026-03-21 10:41:32 +00:00
  • dfab8b94ca Merge branch 'main' into copilot/remove-gtest-use-custom-framework copilot/remove-gtest-use-custom-framework Changho Hwang 2026-03-21 02:37:22 -07:00
  • 41c7ce48e5 WIP binyli/handle-fix Binyang Li 2026-03-20 04:59:29 +00:00
  • b7adec0e60 create sglang docker image rjsouza/sglang-tests empyreus 2026-03-19 20:03:00 +00:00
  • 5d18835417 Fix use-after-free for fabric allocation handle in GpuIpcMemHandle (#764) main Binyang Li 2026-03-19 11:52:09 -07:00
  • 2a4f270bc9 for accumulation Binyang Li 2026-03-19 16:11:11 +00:00
  • fc3f4b9755 tune #instances and remoce extra barriers binyli/ib-no-atomic-test Ubuntu 2026-03-19 00:41:33 +00:00
  • 0ceef09eda WIP Binyang Li 2026-03-18 22:35:07 +00:00
  • 02005322a7 Merge branch 'copilot/remove-gtest-use-custom-framework' into chhwang/fix-ib-no-atomic chhwang/fix-ib-no-atomic Changho Hwang 2026-03-18 14:04:20 -07:00
  • 79a014976d updates Changho Hwang 2026-03-18 20:30:18 +00:00
  • 6082648f80 fix for npkit Changho Hwang 2026-03-18 20:06:37 +00:00
  • bff76d5b85 Fix TearDown() handling and replace assert() in perf tests copilot-swe-agent[bot] 2026-03-18 19:44:11 +00:00
  • c38c3517fd attempting to gix az cli empyreus 2026-03-18 19:36:40 +00:00
  • 08092653b2 install pip systemwide empyreus 2026-03-18 19:10:56 +00:00
  • b7ede93f13 move from apt-get to pip empyreus 2026-03-18 18:55:07 +00:00
  • 4742dfef39 fix sudo issue empyreus 2026-03-18 18:30:09 +00:00
  • 343c3671ef fix sudo empyreus 2026-03-18 18:07:25 +00:00
  • 9ef1fb7cee Run pass the multinode test Qinghua Zhou 2026-03-18 17:08:22 +00:00
  • 275622159c update Changho Hwang 2026-03-18 02:32:21 +00:00
  • 47cdfc9c3b Match with message size of NCCL EP bench test qinghuazhou/alltoallv_kernel Qinghua Zhou 2026-03-18 02:30:08 +00:00
  • 2297a3deda updates Changho Hwang 2026-03-18 00:58:08 +00:00
  • ffa120f6b1 rework template empyreus 2026-03-17 21:58:01 +00:00
  • 9dd47b3c27 update the executor so we have message size range Ubuntu 2026-03-17 21:00:48 +00:00
  • c919b961d2 show scale in output Ubuntu 2026-03-17 20:43:32 +00:00
  • 51416d6a67 update scripts Ubuntu 2026-03-17 20:21:13 +00:00
  • 0f38ab592f add scripts Ubuntu 2026-03-17 20:06:15 +00:00
  • 5a65cc7aba debugging Changho Hwang 2026-03-17 20:00:34 +00:00
  • 8dc63faedd re-format output Ubuntu 2026-03-17 19:59:35 +00:00
  • 8686d81de5 testing Empyreus 2026-03-17 19:45:07 +00:00
  • 371dfb3cc3 fix pip Empyreus 2026-03-17 19:19:28 +00:00
  • a8edfb7cf9 WIP Binyang Li 2026-03-17 19:16:55 +00:00
  • 431234f0a4 inital pipeline test Empyreus 2026-03-17 18:45:42 +00:00
  • c777290271 update Binyang Li 2026-03-17 18:12:42 +00:00
  • c84c2ede20 update the number of instances Ubuntu 2026-03-17 17:39:29 +00:00
  • d66d7e4743 debugging Changho Hwang 2026-03-17 01:41:40 +00:00
  • a937ce4a8d debugging Changho Hwang 2026-03-16 20:35:46 +00:00
  • 2c4bab8359 fix Changho Hwang 2026-03-16 18:37:57 +00:00
  • 5f42426dc8 inital creation of test files Empyreus 2026-03-16 17:47:48 +00:00
  • bdb30b56a5 Broadcast UniqueId via TCP; Detect whether torch comparison is possible Qinghua Zhou 2026-03-16 10:01:35 +00:00
  • f47e97659d Update the benchmark to improve the rank mapping, communicator creation, backend selection Qinghua Zhou 2026-03-10 03:17:12 +00:00
  • 958858125f Add NCCL EP bench equivalent workloads Qinghua Zhou 2026-03-16 08:48:48 +00:00
  • 2cb8adff14 WIP Binyang Li 2026-03-14 05:59:32 +00:00
  • e508c6755e fix memory leak Binyang Li 2026-03-14 04:09:38 +00:00
  • 42d9845a69 update Ubuntu 2026-03-12 16:48:35 +00:00
  • e2a9692674 fix merge Changho Hwang 2026-03-11 21:04:45 +00:00
  • a38bd9dee2 Merge branch 'main' into copilot/remove-gtest-use-custom-framework Changho Hwang 2026-03-11 14:02:56 -07:00
  • 2a705f52e1 fix merge Changho Hwang 2026-03-11 20:38:54 +00:00
  • e2a5be467d debugging Changho Hwang 2026-03-11 02:40:50 +00:00
  • 757c0ecc6a debugging Changho Hwang 2026-03-11 01:00:12 +00:00
  • cf505d777a debugging Changho Hwang 2026-03-10 22:18:41 +00:00
  • 7a87c2c856 debugging Changho Hwang 2026-03-10 20:51:22 +00:00
  • 1cc8422af6 update Ubuntu 2026-03-10 19:01:37 +00:00
  • 6647338fb4 debugging Changho Hwang 2026-03-10 17:50:04 +00:00
  • 1071ddb050 Update the benchmark to improve the rank mapping, communicator creation, backend selection qinghuazhou/alltoallv_kernel_gb200_benchmark Qinghua Zhou 2026-03-10 03:17:12 +00:00
  • a9cf93863f fix Changho Hwang 2026-03-09 23:49:54 +00:00
  • e6595f1be5 Use -1 sentinel for MSCCLPP_IB_GID_INDEX default and improve comment binyli/unique-qp-and-gid-index Ubuntu 2026-03-09 23:31:36 +00:00
  • aa5d42fc3a Change MSCCLPP_IB_GID_INDEX default to 0 Ubuntu 2026-03-09 23:19:40 +00:00
  • 8c7298a357 Add missing env fields to Python binding Ubuntu 2026-03-09 23:15:40 +00:00
  • 57af391a2a update Ubuntu 2026-03-09 22:40:35 +00:00
  • ce9bada51b update Ubuntu 2026-03-09 22:38:08 +00:00
  • 2478553b22 Unique QP per channel and env-controlled GID index Ubuntu 2026-03-09 20:27:28 +00:00
  • a76dbe8587 update Ubuntu 2026-03-09 22:40:35 +00:00
  • 982b1f3f4e update Ubuntu 2026-03-09 22:38:08 +00:00
  • 3efb1fd0d3 update Ubuntu 2026-03-09 21:16:46 +00:00
  • 30777565ac Unique QP per channel and env-controlled GID index Ubuntu 2026-03-09 20:27:28 +00:00
  • 5d9f7612f9 debug Ubuntu 2026-03-09 20:05:46 +00:00
  • bf946ea51e Fix multicast handle leak, cuMemMap offset handling, and rename NVLS allreduce algorithms (#759) Binyang Li 2026-03-09 10:22:45 -07:00
  • 4892b4ebea fix Ubuntu 2026-03-08 23:36:43 +00:00
  • d6a6fa2ffa simplified Changho Hwang 2026-03-08 05:31:48 +00:00
  • ea1dd65126 fix Changho Hwang 2026-03-08 04:05:58 +00:00
  • bcb392ffdf updates Changho Hwang 2026-03-08 03:33:51 +00:00
  • 375bc13831 fix Changho Hwang 2026-03-07 02:53:54 +00:00
  • c40a233f55 fix Changho Hwang 2026-03-07 02:48:08 +00:00
  • e0c7ddb5ff fix Changho Hwang 2026-03-07 02:33:20 +00:00
  • 75ac8be225 fix Changho Hwang 2026-03-07 02:31:51 +00:00
  • 284d9139c9 Merge branch 'main' into copilot/remove-gtest-use-custom-framework Changho Hwang 2026-03-06 18:26:02 -08:00
  • c699b8a784 az pipeline refactoring Changho Hwang 2026-03-07 02:23:30 +00:00
  • 3751f0299b Fix NCCL fallback comm destroy and use latest NCCL release in CI (#760) Binyang Li 2026-03-06 16:33:35 -08:00
  • 00583da21b separate pipeline for codecov Changho Hwang 2026-03-06 21:31:04 +00:00
  • 60ff32c014 updates Changho Hwang 2026-03-06 19:40:34 +00:00
  • bbb9c10a1e Update Docker image Changho Hwang 2026-03-06 19:15:04 +00:00
  • e9247ae2cc Semaphore in Connection chhwang/sema-in-conn Changho Hwang 2026-03-06 19:07:43 +00:00
  • 6bbb0425a2 debug Ubuntu 2026-03-06 18:25:03 +00:00
  • 32c8f9a704 Change persist MemoryChannel objects as class member to prevent dangling device pointers qinghuazhou/alltoallv_kernel_gb200 Qinghua Zhou 2026-03-06 03:58:05 +00:00
  • 0bebabc998 Fix cross-node CudaIpc for GB200 NVL: graceful IPC fallback and IMEX diagnostics Qinghua Zhou 2026-03-06 01:32:46 +00:00
  • 7ce841bed0 Updates Changho Hwang 2026-03-05 23:28:39 +00:00
  • 448ceb66f6 updates Changho Hwang 2026-03-05 22:59:33 +00:00
  • 3d3e272d3b Use cudaMalloc instead of GpuBuffer for communication buffers in alltoallv_test Qinghua Zhou 2026-03-05 14:58:44 +00:00
  • 237302258d PosixFx path falls through to RuntimeIpc when unix socket is unreachable (cross-node) Qinghua Zhou 2026-03-05 14:34:42 +00:00
  • cba13f83fc Exchange recv buffer with all peers and connect via CudaIpc in alltoallv_test Qinghua Zhou 2026-03-05 14:07:37 +00:00
  • 872bc433a9 Determine nRanksPerNode and localRank using hostname matching in mscclpp-test Qinghua Zhou 2026-03-05 13:33:24 +00:00
  • 82ed577a09 Force cudaIpc connection for memoryChannels on gb200 for nvlink-connected peers Qinghua Zhou 2026-03-05 10:37:35 +00:00
  • 3b56b08bcb data direct Changho Hwang 2026-03-04 23:36:39 +00:00
  • 03fc782c42 wip caiorocha/fix_channel_order Caio Rocha 2026-03-04 18:53:21 +00:00
  • acfcca7f87 Support hybrid connections for single and multi node Qinghua Zhou 2026-03-04 15:20:15 +00:00
  • f4b8574a1c Merge branch 'main' into copilot/remove-gtest-use-custom-framework Changho Hwang 2026-03-03 15:49:01 -08:00
  • d5743e2d6c Integrate with MoE training flow qinghuazhou/alltoallv_kernel_moe_integration Qinghua Zhou 2026-03-03 15:17:20 +00:00
  • 69565a2f32 Do threadInit/cudaSetDevice before other cuda calls (#757) Xingbo Wu 2026-03-02 23:53:59 +00:00
  • badd279bef fix lcov version chhwang/docker Changho Hwang 2026-03-02 11:52:16 -08:00
  • d00713d3c2 Add more real moe workloads for alltoallv Qinghua Zhou 2026-03-02 12:51:21 +00:00