mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-25 15:24:43 +00:00

Author	SHA1	Message	Date
Binyang Li	25435acf5d	Add new algos for GB200 (#747 ) - Add new algos (allreduce_rsag, allreduce_rsag_pipeline and allreduce_rsag_zero_copy) for GB200. - Add IB stub for non-IB env - Provides example for algorithm tunning with different nblocks/nthreads Perf for allreduce_rsag ``` # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 262144 float sum -1 25.16 41.67 62.51 0 23.73 44.18 66.27 0 2097152 524288 float sum -1 26.06 80.47 120.71 0 25.31 82.86 124.29 0 4194304 1048576 float sum -1 31.09 134.93 202.39 0 30.75 136.39 204.58 0 8388608 2097152 float sum -1 45.52 184.29 276.43 0 45.13 185.87 278.80 0 16777216 4194304 float sum -1 75.73 221.53 332.30 0 75.51 222.18 333.27 0 33554432 8388608 float sum -1 137.25 244.48 366.72 0 137.22 244.54 366.81 0 67108864 16777216 float sum -1 271.34 247.32 370.99 0 270.86 247.76 371.65 0 134217728 33554432 float sum -1 534.25 251.22 376.84 0 534.43 251.14 376.71 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 264.454 # # Collective test concluded: all_reduce_perf ``` perf for allreduce_rsag_pipeline ``` # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 262144 float sum -1 61.57 17.03 25.55 0 61.51 17.05 25.57 0 2097152 524288 float sum -1 61.31 34.20 51.31 0 61.23 34.25 51.38 0 4194304 1048576 float sum -1 61.62 68.06 102.10 0 61.84 67.83 101.74 0 8388608 2097152 float sum -1 61.97 135.37 203.06 0 61.89 135.53 203.30 0 16777216 4194304 float sum -1 63.15 265.65 398.48 0 62.89 266.76 400.15 0 33554432 8388608 float sum -1 100.63 333.46 500.19 0 99.76 336.34 504.51 0 67108864 16777216 float sum -1 180.04 372.75 559.13 0 179.75 373.34 560.01 0 134217728 33554432 float sum -1 339.60 395.23 592.84 0 338.16 396.91 595.36 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 304.665 # # Collective test concluded: all_reduce_perf ``` perf for allreduce_rsag_zero_copy ``` # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 262144 float sum -1 14.99 69.93 104.90 0 14.44 72.61 108.92 0 2097152 524288 float sum -1 16.19 129.56 194.33 0 15.85 132.32 198.48 0 4194304 1048576 float sum -1 21.19 197.98 296.97 0 20.64 203.20 304.81 0 8388608 2097152 float sum -1 31.04 270.27 405.41 0 30.68 273.44 410.16 0 16777216 4194304 float sum -1 50.34 333.26 499.89 0 50.15 334.51 501.77 0 33554432 8388608 float sum -1 89.58 374.56 561.84 0 88.65 378.48 567.73 0 67108864 16777216 float sum -1 165.69 405.03 607.54 0 163.64 410.10 615.16 0 134217728 33554432 float sum -1 323.19 415.28 622.93 0 318.01 422.05 633.07 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 414.619 # # Collective test concluded: all_reduce_perf ``` --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com> Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com> Co-authored-by: Caio Rocha <caiorocha@microsoft.com>	2026-02-24 16:43:23 -08:00
Binyang Li	bd68319e3e	Refactor algo selection logic and introduce symmetric_memory env (#741 ) This PR refactors the algorithm selection logic in MSCCL++ and introduces support for symmetric memory configuration through environment variables. 1. Algorithm Selection Refactoring Use separate class for algo selection. Could introduce more complex logic for algo selection based on message size, arch, if cuda graph is enabled and memory allocation method 2. Symmetric Memory Support Introduced symmetricMemory parameter in algorithm context key generation. Remove disableChannelCache env as is ambiguous 3. Add new args for build_default_algorithms Add flag_buffer, and flag_buffer_size args to build default algorithm. Then we could use unified flag buffer for different algorithms, avoid application hanging when switch algo for different message size. --------- Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com> Co-authored-by: Qinghua Zhou <qinghuazhou@microsoft.com> Co-authored-by: Caio Rocha <caiorocha@microsoft.com>	2026-02-12 19:06:18 -08:00
Binyang Li	a707273701	Torch integration (#692 ) Reorganize current native algorithm implementation and DSL algorithm implementation. Provide unified API for DSL algo and native algo and provide interface to tune the algo Provide interface for pytorch integration with native API and DSL --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2026-01-21 20:32:24 -08:00
Binyang Li	5acac93dbc	Integrate MSCCL++ DSL to torch workload (#620 ) Provides two integration ways for MSCCL++ DSL. 1. Integrate with customized communication group 2. Integrate with NCCL API Introduce new Python APIs to make it work: ```python mscclpp.compile # compile dsl to json based execution plan mscclpp.ExecutionPlanRegistry.register_plan(plan) # register the compiled plan to executionPlanRegistery mscclpp.ExecutionPlanRegistry.set_selector(selector) # set the selector, the selector will return the best execution plan based on collection, message size, world size.... ``` Fix #556 --------- Co-authored-by: Caio Rocha <caiorocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-10-29 15:39:00 -07:00

4 Commits