mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-05-11 17:00:22 +00:00
Implement single node all2all via MSCCL++ C++API
perf kernel 3:
```
size count time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1048576 32768 23.41 44.78 39.19 0
2097152 65536 23.95 87.56 76.61 0
4194304 131072 27.50 152.51 133.45 0
8388608 262144 35.14 238.73 208.89 0
16777216 524288 57.54 291.55 255.11 0
33554432 1048576 109.7 305.81 267.59 0
67108864 2097152 212.3 316.07 276.56 0
134217728 4194304 410.9 326.64 285.81 0
268435456 8388608 784.9 341.99 299.24 0
```
kernel 2
```
# in-place out-of-place
# size count time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1048576 32768 23.42 44.77 39.17 0
2097152 65536 24.96 84.02 73.52 0
4194304 131072 28.53 147.03 128.65 0
8388608 262144 36.75 228.28 199.75 0
16777216 524288 58.01 289.20 253.05 0
33554432 1048576 110.4 303.83 265.85 0
67108864 2097152 212.4 315.99 276.49 0
134217728 4194304 407.8 329.12 287.98 0
268435456 8388608 797.4 336.64 294.56 0
```
NCCL:
```
NCCL version 2.21.5+cuda12.4
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8388608 524288 half none -1 38.70 216.75 189.66 0 39.25 213.72 187.00 N/A
16777216 1048576 half none -1 71.39 234.99 205.62 0 68.41 245.25 214.60 N/A
33554432 2097152 half none -1 119.7 280.22 245.20 0 119.8 280.17 245.15 N/A
67108864 4194304 half none -1 211.9 316.66 277.08 0 212.7 315.53 276.09 N/A
134217728 8388608 half none -1 408.4 328.61 287.53 0 393.8 340.87 298.26 N/A
268435456 16777216 half none -1 761.6 352.47 308.41 0 763.3 351.70 307.73 N/A
536870912 33554432 half none -1 1502.5 357.31 312.64 0 1467.3 365.89 320.16 N/A
```