mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-05-11 17:00:22 +00:00
Add deviceSemaphore structure, implement a new NVLS based algo to show
how to use these APIs. Current perf for NVLS non-zero copy version is:
```
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1024 512 half sum -1 6.10 0.17 0.29 0 5.65 0.18 0.32 0
2048 1024 half sum -1 5.94 0.35 0.60 0 5.85 0.35 0.61 0
4096 2048 half sum -1 6.11 0.67 1.17 0 5.97 0.69 1.20 0
8192 4096 half sum -1 6.22 1.32 2.31 0 6.17 1.33 2.33 0
16384 8192 half sum -1 6.68 2.45 4.29 0 6.52 2.51 4.39 0
32768 16384 half sum -1 8.02 4.09 7.15 0 7.66 4.28 7.49 0
65536 32768 half sum -1 8.09 8.10 14.18 0 7.91 8.29 14.51 0
131072 65536 half sum -1 9.58 13.68 23.93 0 9.61 13.64 23.86 0
262144 131072 half sum -1 12.60 20.81 36.42 0 12.28 21.35 37.37 0
524288 262144 half sum -1 14.51 36.12 63.22 0 14.09 37.21 65.12 0
1048576 524288 half sum -1 19.45 53.92 94.36 0 19.29 54.35 95.12 0
2097152 1048576 half sum -1 31.00 67.66 118.40 0 30.80 68.08 119.14 0
4194304 2097152 half sum -1 44.71 93.80 164.16 0 44.66 93.91 164.34 0
8388608 4194304 half sum -1 62.96 133.24 233.17 0 62.49 134.24 234.91 0
16777216 8388608 half sum -1 105.1 159.68 279.45 0 104.4 160.74 281.29 0
33554432 16777216 half sum -1 169.9 197.55 345.71 0 169.8 197.64 345.87 0
67108864 33554432 half sum -1 298.1 225.12 393.96 0 298.1 225.09 393.91 0
134217728 67108864 half sum -1 552.9 242.77 424.84 0 553.7 242.39 424.18 0
268435456 134217728 half sum -1 1055.8 254.24 444.91 0 1056.9 253.98 444.47 0
536870912 268435456 half sum -1 2040.1 263.15 460.52 0 2045.1 262.52 459.40 0
1073741824 536870912 half sum -1 3996.9 268.65 470.13 0 4007.7 267.92 468.86 0
```