Files
mscclpp/include
Binyang Li affca7d9bc Add NVLS based fallback algo (#507)
Add two nvls based fallback algo. allreduce9 is for nvls with zero copy.
allreduce10 is for nvls need to copy to scratch buffer, do reduce
operation then copy result back to result buffer.

Perf number for allreduce9
```
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
        1024           256     float     sum      -1     5.45    0.19    0.33      0     5.35    0.19    0.33      0
        2048           512     float     sum      -1     5.57    0.37    0.64      0     5.53    0.37    0.65      0
        4096          1024     float     sum      -1     5.80    0.71    1.24      0     5.78    0.71    1.24      0
        8192          2048     float     sum      -1     5.94    1.38    2.42      0     5.85    1.40    2.45      0
       16384          4096     float     sum      -1     6.40    2.56    4.48      0     6.27    2.61    4.57      0
       32768          8192     float     sum      -1     7.45    4.40    7.70      0     7.39    4.43    7.76      0
       65536         16384     float     sum      -1     8.03    8.17   14.29      0     8.32    7.88   13.79      0
      131072         32768     float     sum      -1     7.28   18.00   31.49      0     7.07   18.53   32.43      0
      262144         65536     float     sum      -1     7.72   33.95   59.41      0     7.59   34.56   60.48      0
      524288        131072     float     sum      -1     8.70   60.29  105.51      0     8.37   62.61  109.57      0
     1048576        262144     float     sum      -1    10.56   99.26  173.70      0    10.32  101.64  177.87      0
     2097152        524288     float     sum      -1    14.45  145.14  253.99      0    14.02  149.58  261.76      0
     4194304       1048576     float     sum      -1    22.83  183.73  321.52      0    23.03  182.14  318.75      0
     8388608       2097152     float     sum      -1    38.63  217.14  380.00      0    38.57  217.52  380.65      0
    16777216       4194304     float     sum      -1    70.03  239.58  419.27      0    69.96  239.80  419.66      0
    33554432       8388608     float     sum      -1    131.5  255.17  446.55      0    131.3  255.59  447.28      0
    67108864      16777216     float     sum      -1    255.8  262.37  459.15      0    255.4  262.75  459.82      0
   134217728      33554432     float     sum      -1    500.9  267.94  468.90      0    500.0  268.42  469.74      0
   268435456      67108864     float     sum      -1    989.0  271.41  474.97      0    988.9  271.45  475.05      0
   536870912     134217728     float     sum      -1   1967.4  272.88  477.54      0   1966.0  273.08  477.88      0
  1073741824     268435456     float     sum      -1   3908.5  274.72  480.77      0   3904.6  274.99  481.24      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 218.734 
```

Perf number for allreduce10
```
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
        1024           256     float     sum      -1     5.60    0.18    0.32      0     5.52    0.19    0.32      0
        2048           512     float     sum      -1     5.79    0.35    0.62      0     5.64    0.36    0.64      0
        4096          1024     float     sum      -1     5.92    0.69    1.21      0     5.82    0.70    1.23      0
        8192          2048     float     sum      -1     6.03    1.36    2.38      0     5.95    1.38    2.41      0
       16384          4096     float     sum      -1     6.58    2.49    4.35      0     6.39    2.56    4.49      0
       32768          8192     float     sum      -1     7.54    4.34    7.60      0     7.41    4.42    7.74      0
       65536         16384     float     sum      -1     7.95    8.24   14.42      0     8.10    8.09   14.16      0
      131072         32768     float     sum      -1     9.56   13.72   24.00      0     9.47   13.84   24.23      0
      262144         65536     float     sum      -1    11.49   22.81   39.92      0    11.41   22.97   40.20      0
      524288        131072     float     sum      -1    14.19   36.94   64.64      0    13.88   37.76   66.09      0
     1048576        262144     float     sum      -1    19.10   54.89   96.06      0    18.98   55.24   96.67      0
     2097152        524288     float     sum      -1    31.12   67.38  117.91      0    31.34   66.92  117.10      0
     4194304       1048576     float     sum      -1    44.88   93.46  163.56      0    44.76   93.70  163.97      0
     8388608       2097152     float     sum      -1    63.23  132.68  232.18      0    62.53  134.14  234.75      0
    16777216       4194304     float     sum      -1    106.8  157.03  274.80      0    105.9  158.46  277.30      0
    33554432       8388608     float     sum      -1    172.2  194.91  341.09      0    172.0  195.05  341.35      0
    67108864      16777216     float     sum      -1    299.8  223.83  391.70      0    300.8  223.12  390.46      0
   134217728      33554432     float     sum      -1    553.1  242.66  424.66      0    553.8  242.38  424.16      0
   268435456      67108864     float     sum      -1   1056.1  254.18  444.82      0   1057.4  253.86  444.26      0
   536870912     134217728     float     sum      -1   2064.0  260.11  455.20      0   2063.8  260.14  455.25      0
  1073741824     268435456     float     sum      -1   4074.4  263.53  461.18      0   4065.8  264.09  462.16      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 169.799 
```

---------

Co-authored-by: Sreevatsa Anantharamu <sreevatsanadig@gmail.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2025-04-27 14:09:31 -07:00
..
2023-07-25 10:23:16 -07:00