Files
mscclpp/test
Binyang Li bb76d27553 all2all implementation (#609)
Implement single node all2all via MSCCL++ C++API
perf kernel 3:
```
       size         count     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)     (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
     1048576         32768                                     23.41   44.78   39.19      0
     2097152         65536                                     23.95   87.56   76.61      0
     4194304        131072                                     27.50  152.51  133.45      0
     8388608        262144                                     35.14  238.73  208.89      0
    16777216        524288                                     57.54  291.55  255.11      0
    33554432       1048576                                     109.7  305.81  267.59      0
    67108864       2097152                                     212.3  316.07  276.56      0
   134217728       4194304                                     410.9  326.64  285.81      0
   268435456       8388608                                     784.9  341.99  299.24      0
```

kernel 2
```

#                                        in-place                       out-of-place
#       size         count     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)     (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)
     1048576         32768                                     23.42   44.77   39.17      0
     2097152         65536                                     24.96   84.02   73.52      0
     4194304        131072                                     28.53  147.03  128.65      0
     8388608        262144                                     36.75  228.28  199.75      0
    16777216        524288                                     58.01  289.20  253.05      0
    33554432       1048576                                     110.4  303.83  265.85      0
    67108864       2097152                                     212.4  315.99  276.49      0
   134217728       4194304                                     407.8  329.12  287.98      0
   268435456       8388608                                     797.4  336.64  294.56      0
```

NCCL:
```
NCCL version 2.21.5+cuda12.4
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     8388608        524288      half    none      -1    38.70  216.75  189.66      0    39.25  213.72  187.00    N/A
    16777216       1048576      half    none      -1    71.39  234.99  205.62      0    68.41  245.25  214.60    N/A
    33554432       2097152      half    none      -1    119.7  280.22  245.20      0    119.8  280.17  245.15    N/A
    67108864       4194304      half    none      -1    211.9  316.66  277.08      0    212.7  315.53  276.09    N/A
   134217728       8388608      half    none      -1    408.4  328.61  287.53      0    393.8  340.87  298.26    N/A
   268435456      16777216      half    none      -1    761.6  352.47  308.41      0    763.3  351.70  307.73    N/A
   536870912      33554432      half    none      -1   1502.5  357.31  312.64      0   1467.3  365.89  320.16    N/A
```
2025-08-14 11:30:40 -07:00
..
2025-07-11 23:53:59 +00:00
2025-08-09 00:36:20 -07:00
2025-07-11 23:53:59 +00:00
2025-07-12 00:10:46 +00:00
2025-06-23 15:42:44 -07:00
2025-08-09 00:36:20 -07:00
2025-04-22 17:09:19 -07:00