Qinghua Zhou
|
ec011f14ea
|
Add detection of torch.baseline and debug info
|
2026-03-25 01:52:24 +00:00 |
|
Qinghua Zhou
|
7e1cb7b8cf
|
Support cross-node CudaIPC
|
2026-03-21 10:41:32 +00:00 |
|
Qinghua Zhou
|
9ef1fb7cee
|
Run pass the multinode test
|
2026-03-18 17:08:22 +00:00 |
|
Qinghua Zhou
|
bdb30b56a5
|
Broadcast UniqueId via TCP; Detect whether torch comparison is possible
|
2026-03-16 10:01:35 +00:00 |
|
Qinghua Zhou
|
f47e97659d
|
Update the benchmark to improve the rank mapping, communicator creation, backend selection
|
2026-03-16 09:25:34 +00:00 |
|
Qinghua Zhou
|
d00713d3c2
|
Add more real moe workloads for alltoallv
|
2026-03-02 12:51:21 +00:00 |
|
Qinghua Zhou
|
ee843d445f
|
Add test of real MoE workloads
|
2026-02-25 12:39:48 +00:00 |
|
Qinghua Zhou
|
ae59eab6a2
|
Add unified benchmarking function to test all_to_all_single of mscclpp and torch
|
2026-02-24 07:17:17 +00:00 |
|
Qinghua Zhou
|
715ecd91cf
|
Add baseline test of torch.distributed.all_to_all_single
|
2026-02-24 06:51:10 +00:00 |
|
Qinghua Zhou
|
98be0def08
|
Use variable sizes in the peformance test
|
2026-02-24 06:29:46 +00:00 |
|
Qinghua Zhou
|
6292b6ab33
|
Report undirectional bandwidth
|
2026-02-24 06:02:33 +00:00 |
|
Qinghua Zhou
|
21e3f1ebb3
|
Get correct remote receive displacements for peers
|
2026-02-23 14:22:30 +00:00 |
|
Qinghua Zhou
|
7ba83e20dd
|
PyTorch-compatible all_to_all_single API using mscclpp kernels
|
2026-02-23 09:51:51 +00:00 |
|