mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-11 17:00:22 +00:00

Author	SHA1	Message	Date
Caio Rocha	eb202780f5	Support Synchronous Initialization for Proxy Service (#679 )	2025-11-12 18:35:57 -08:00
Changho Hwang	ffafcaf6d6	IB stack enhancements & bug fixes (#673 ) * Always use `ibv_reg_dmabuf_mr` when DMABUF is supported * Do not check `nvidia-peermem` when unnecessary * More rigorous check on IB port availability * Fixed ibverbs wrappers * Fixed `IbPeerToPeerTest.SimpleAtomicAdd` test	2025-11-07 14:26:17 -08:00
Changho Hwang	960a8ddebd	Add a new logger (#668 ) * Add `logger.hpp` that will gradually replace `debug.h` * Minor fixes in `ib.cc` --------- Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>	2025-11-04 10:32:46 -08:00
Binyang Li	5acac93dbc	Integrate MSCCL++ DSL to torch workload (#620 ) Provides two integration ways for MSCCL++ DSL. 1. Integrate with customized communication group 2. Integrate with NCCL API Introduce new Python APIs to make it work: ```python mscclpp.compile # compile dsl to json based execution plan mscclpp.ExecutionPlanRegistry.register_plan(plan) # register the compiled plan to executionPlanRegistery mscclpp.ExecutionPlanRegistry.set_selector(selector) # set the selector, the selector will return the best execution plan based on collection, message size, world size.... ``` Fix #556 --------- Co-authored-by: Caio Rocha <caiorocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-10-29 15:39:00 -07:00
Qinghua Zhou	a38c2ee784	FP8 support for Allreduce (#646 ) Add FP8 support for Allreduce on both NVIDIA and AMD platform. Add new data type: fp8_e4m3 and fp8_e5m2 --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-10-27 14:51:48 -07:00
Binyang Li	2b05908635	Add token pool for cuCreate API (#628 ) Create a tokenPool to allocate token. This feature is used to support inter node NVL and try to reduce the footprint caused by cuCreate --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-10-27 11:19:21 -07:00
Changho Hwang	68b1f151f0	Rename nvls* files (#660 ) Rename nvls* files to switch_channel*	2025-10-24 11:34:26 -07:00
Changho Hwang	200cdf946e	Update `EndpointConfig` interfaces (#651 ) * Separate IB-specific options into a nested struct * Enable `connect()` by an `Endpoint`, not only by `EndpointConfig` * Other minor changes	2025-10-22 10:39:39 -07:00
Binyang Li	3d94383696	Add MSCCLPP_GIT_COMMIT micro (#640 ) - Add MSCCLPP_GIT_COMMIT micro - Update docs	2025-10-06 15:57:28 -07:00
Binyang Li	70b8297c56	Revise NCCL API implementation (#617 ) - Make nccl interface extensible. Customer can register their own algo to NCCL API. User can provide customized algo selection function. - Fallback to NCCL/RCCL if no algo is selected based on algo selection function - MSCCLPP interfaces now works for any scale	2025-09-26 10:08:12 -07:00
Binyang Li	5ac427610d	Address teardown issue (#638 ) Ignore cuda/cu errors during teardown. Some pointer may be invalid at this point	2025-09-25 12:12:40 -07:00
Changho Hwang	43f160c8e6	Fix for safe process teardown (#633 ) * `gpuFree()` functions are usually called during process teardown, so we let them ignore regarding errors. `AvoidCudaGraphCaptureGuard` is constructed in `gpuFree*()` functions, so it needs the same fix.	2025-09-10 20:28:04 -07:00
Changho Hwang	571fee16fb	Add `FifoDeviceHandle::poll()` (#630 )	2025-09-09 23:32:35 +00:00
Changho Hwang	c4d8781390	Fix memory exchange within a single process (#624 )	2025-09-04 12:53:51 -07:00
Binyang Li	bb76d27553	all2all implementation (#609 ) Implement single node all2all via MSCCL++ C++API perf kernel 3: ``` size count time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 32768 23.41 44.78 39.19 0 2097152 65536 23.95 87.56 76.61 0 4194304 131072 27.50 152.51 133.45 0 8388608 262144 35.14 238.73 208.89 0 16777216 524288 57.54 291.55 255.11 0 33554432 1048576 109.7 305.81 267.59 0 67108864 2097152 212.3 316.07 276.56 0 134217728 4194304 410.9 326.64 285.81 0 268435456 8388608 784.9 341.99 299.24 0 ``` kernel 2 ``` # in-place out-of-place # size count time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 32768 23.42 44.77 39.17 0 2097152 65536 24.96 84.02 73.52 0 4194304 131072 28.53 147.03 128.65 0 8388608 262144 36.75 228.28 199.75 0 16777216 524288 58.01 289.20 253.05 0 33554432 1048576 110.4 303.83 265.85 0 67108864 2097152 212.4 315.99 276.49 0 134217728 4194304 407.8 329.12 287.98 0 268435456 8388608 797.4 336.64 294.56 0 ``` NCCL: ``` NCCL version 2.21.5+cuda12.4 # # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 8388608 524288 half none -1 38.70 216.75 189.66 0 39.25 213.72 187.00 N/A 16777216 1048576 half none -1 71.39 234.99 205.62 0 68.41 245.25 214.60 N/A 33554432 2097152 half none -1 119.7 280.22 245.20 0 119.8 280.17 245.15 N/A 67108864 4194304 half none -1 211.9 316.66 277.08 0 212.7 315.53 276.09 N/A 134217728 8388608 half none -1 408.4 328.61 287.53 0 393.8 340.87 298.26 N/A 268435456 16777216 half none -1 761.6 352.47 308.41 0 763.3 351.70 307.73 N/A 536870912 33554432 half none -1 1502.5 357.31 312.64 0 1467.3 365.89 320.16 N/A ```	2025-08-14 11:30:40 -07:00
Binyang Li	be6a941fba	New DSL implementation (#579 ) The PR contains following changes: Python side: - Channel based DSL implementation: decouple channel with chunk. - Users create channel explicitly, only need local_rank, remote_rank and channel_type - Adjust executor json file, add remote_buffer fields, different op can use different channel and remote buffers combination. - Reimplement operation fusion, data dependency check mechanism - Add new op such as semaphore, pipeline - Clean code and enhance document C++ side: - Support new execution file json format - Support semaphore and pipeline operation - code clean, support non-zero copy scenario --------- Co-authored-by: Caio Rocha <caiorocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-08-09 00:36:20 -07:00
Changho Hwang	542a10f69e	Merge ChannelTrigger with ProxyTrigger (#601 )	2025-08-08 19:07:50 +00:00
Changho Hwang	9650e5c37e	Update documentation (#576 ) Documentation overhaul	2025-08-07 15:37:37 -07:00
Binyang Li	4f6f23dae3	Use smart pointer for IB structure (#585 ) Change to use smart pointer for IB structure. Registered memory will own ibMr, ibCtx will not held the reference - Use smart pointer for IbQp and IbMr - Update memoryChannel API, keep localRegisteredMemory - Close fd when registedMemory released --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-08-06 10:01:58 -07:00
Changho Hwang	d55ac96f5e	Fixed the local channel test (#597 )	2025-08-05 15:33:48 -07:00
Changho Hwang	334b232e36	Fix GpuStreamPool to be aware of the device ID of streams (#590 ) Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-08-04 11:07:31 -07:00
Changho Hwang	c580e4c503	Support CudaIpc connection within a single process (#593 ) * Allow CudaIpc connection between GPUs in a single process * Added an example of connection in a single process * Minor interface updates --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-08-02 12:59:36 +08:00
Changho Hwang	aa28b06bf5	Fix relaxedWait() (#594 )	2025-08-01 12:51:30 +08:00
Binyang Li	5e991cf5c8	update readme & bump version (#550 ) Co-authored-by: github-actions <github-actions@github.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-07-12 01:00:18 -07:00
Changho Hwang	199468bc47	Revise NVLS interface (#458 ) * Rename `NvlsConnection::DeviceMulticastPointer` to `SwitchChannel` * Minor interface improvements	2025-07-12 00:33:03 +00:00
Changho Hwang	ae56698d67	New semaphore constructors (#559 ) More intuitive interfaces for creating semaphores and channels. Also allows channel construction using third-party bootstrappers directly without overriding MSCCL++ Bootstrap.	2025-07-12 00:10:46 +00:00
Changho Hwang	20eca28942	Fix a FIFO correctness bug (#549 ) * Add a FIFO test code that reproduced a correctness issue * Fix the correctness issue by using pinned memory instead of cudaMemcpy --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2025-07-11 23:53:59 +00:00
Changho Hwang	3de6d5b63a	Fix #557 (#560 ) * Page-locking `Host2DeviceSemaphore::outboundSemaphore_` caused unexpected performance issues so reverting it back. We may revisit this later. * Removed reference to connections from context as now connections refer to context.	2025-06-30 11:33:19 -07:00
Changho Hwang	b4dde38db8	FIFO improvements (#557 ) * Revert `MSCCLPP_FIFO_USE_TAIL_REPLICA=1` back to the default. * Optimize `FifoDeviceHandle`. * Do not use `cudaHostAllocWriteCombined` that increases latency. * Pin host memory for `Host2DeviceSemaphore::outboundSemaphore_`. * Fix proxy NUMA binding issues. * Prevent graph capture inside proxy threads. * Now `CudaIpcConnection` skips stream sync when unnecessary. * Now any type of connection needs to hold a shared pointer to the context for memory safety. * Now a context should be always managed by a shared pointer for memory safety. * Minor docs & interface improvements. * Minor fix in `mscclpp-test` correctness test.	2025-06-24 09:50:28 -07:00
Binyang Li	81699a5bdd	DeviceSemaphore fix (#553 ) Fix the bug, make sure a thread will be wake up if semaphore be released. This pull request includes a modification to the `DeviceSemaphore` struct in the `concurrency_device.hpp` file, specifically in the `acquire` method. The change refines the logic for acquiring a semaphore by adjusting the condition used to handle contention scenarios.	2025-06-19 12:30:01 -07:00
Changho Hwang	a36dcd56bf	Do not use tail replica by default (#544 ) Added `MSCCLPP_FIFO_USE_TAIL_REPLICA` environment variable to control whether to use a tail replica for the FIFO buffer. Default is false.	2025-06-12 14:07:10 -07:00
Changho Hwang	f694f2e46b	Fix #509 (#546 ) Fix a destruction order issue	2025-06-05 19:36:02 -07:00
Changho Hwang	125d6f5809	Multi-stream CUDA IPC (#326 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Sreevatsa Anantharamu <sreevatsanadig@gmail.com>	2025-06-04 10:31:04 -07:00
Changho Hwang	253a1ba1a9	Use a stream pool for `gpuCalloc()` (#509 ) Previous `gpuCalloc()` creates a new stream for each allocation, which messes the timeline up in profiler traces. Now `GpuStreamPool` allows reusing the temporal streams.	2025-06-04 10:07:20 -07:00
Changho Hwang	83356957bd	Improved documentation & minor interface revision (#541 )	2025-06-03 14:26:27 -07:00
Changho Hwang	c184485808	DLPack fixes (#537 ) * Fix typos in type name * Make it work without current context set	2025-05-27 21:40:50 +00:00
Changho Hwang	7278b51e61	Rename `ChannelTrigger` fields and check field values in debug builds (#529 )	2025-05-27 14:36:22 -07:00
Changho Hwang	2b9b18d562	Address NVCC warning #20012-D (#528 )	2025-05-21 10:37:50 -07:00
Binyang Li	d1869011c2	Add device semaphore API (#523 ) Add deviceSemaphore structure, implement a new NVLS based algo to show how to use these APIs. Current perf for NVLS non-zero copy version is: ``` # # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 512 half sum -1 6.10 0.17 0.29 0 5.65 0.18 0.32 0 2048 1024 half sum -1 5.94 0.35 0.60 0 5.85 0.35 0.61 0 4096 2048 half sum -1 6.11 0.67 1.17 0 5.97 0.69 1.20 0 8192 4096 half sum -1 6.22 1.32 2.31 0 6.17 1.33 2.33 0 16384 8192 half sum -1 6.68 2.45 4.29 0 6.52 2.51 4.39 0 32768 16384 half sum -1 8.02 4.09 7.15 0 7.66 4.28 7.49 0 65536 32768 half sum -1 8.09 8.10 14.18 0 7.91 8.29 14.51 0 131072 65536 half sum -1 9.58 13.68 23.93 0 9.61 13.64 23.86 0 262144 131072 half sum -1 12.60 20.81 36.42 0 12.28 21.35 37.37 0 524288 262144 half sum -1 14.51 36.12 63.22 0 14.09 37.21 65.12 0 1048576 524288 half sum -1 19.45 53.92 94.36 0 19.29 54.35 95.12 0 2097152 1048576 half sum -1 31.00 67.66 118.40 0 30.80 68.08 119.14 0 4194304 2097152 half sum -1 44.71 93.80 164.16 0 44.66 93.91 164.34 0 8388608 4194304 half sum -1 62.96 133.24 233.17 0 62.49 134.24 234.91 0 16777216 8388608 half sum -1 105.1 159.68 279.45 0 104.4 160.74 281.29 0 33554432 16777216 half sum -1 169.9 197.55 345.71 0 169.8 197.64 345.87 0 67108864 33554432 half sum -1 298.1 225.12 393.96 0 298.1 225.09 393.91 0 134217728 67108864 half sum -1 552.9 242.77 424.84 0 553.7 242.39 424.18 0 268435456 134217728 half sum -1 1055.8 254.24 444.91 0 1056.9 253.98 444.47 0 536870912 268435456 half sum -1 2040.1 263.15 460.52 0 2045.1 262.52 459.40 0 1073741824 536870912 half sum -1 3996.9 268.65 470.13 0 4007.7 267.92 468.86 0 ```	2025-05-20 09:32:38 -07:00
Changho Hwang	5205618c4a	Fix device assert (#522 ) * Fixed a bug that external `assert()`s may not be compiled with mscclpp headers * Use a macro assert instead of a function	2025-05-12 13:38:11 -07:00
Binyang Li	a464b9f21e	Adding maxSpinCount to port channel flush (#518 ) Fix #482 --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-05-08 21:24:48 -07:00
Changho Hwang	d636093336	Asynchronous setup (#514 ) Cherry-picked a part of features from #167: now `Communicator::setup()` is unneeded. `Communicator::sendMemory()` conducts the task inline, and `Communicator::recvMemory()` and `Communicator::connect()` conducts the task asynchronously without explicit setup.	2025-05-08 22:01:51 +00:00
Binyang Li	affca7d9bc	Add NVLS based fallback algo (#507 ) Add two nvls based fallback algo. allreduce9 is for nvls with zero copy. allreduce10 is for nvls need to copy to scratch buffer, do reduce operation then copy result back to result buffer. Perf number for allreduce9 ``` # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 256 float sum -1 5.45 0.19 0.33 0 5.35 0.19 0.33 0 2048 512 float sum -1 5.57 0.37 0.64 0 5.53 0.37 0.65 0 4096 1024 float sum -1 5.80 0.71 1.24 0 5.78 0.71 1.24 0 8192 2048 float sum -1 5.94 1.38 2.42 0 5.85 1.40 2.45 0 16384 4096 float sum -1 6.40 2.56 4.48 0 6.27 2.61 4.57 0 32768 8192 float sum -1 7.45 4.40 7.70 0 7.39 4.43 7.76 0 65536 16384 float sum -1 8.03 8.17 14.29 0 8.32 7.88 13.79 0 131072 32768 float sum -1 7.28 18.00 31.49 0 7.07 18.53 32.43 0 262144 65536 float sum -1 7.72 33.95 59.41 0 7.59 34.56 60.48 0 524288 131072 float sum -1 8.70 60.29 105.51 0 8.37 62.61 109.57 0 1048576 262144 float sum -1 10.56 99.26 173.70 0 10.32 101.64 177.87 0 2097152 524288 float sum -1 14.45 145.14 253.99 0 14.02 149.58 261.76 0 4194304 1048576 float sum -1 22.83 183.73 321.52 0 23.03 182.14 318.75 0 8388608 2097152 float sum -1 38.63 217.14 380.00 0 38.57 217.52 380.65 0 16777216 4194304 float sum -1 70.03 239.58 419.27 0 69.96 239.80 419.66 0 33554432 8388608 float sum -1 131.5 255.17 446.55 0 131.3 255.59 447.28 0 67108864 16777216 float sum -1 255.8 262.37 459.15 0 255.4 262.75 459.82 0 134217728 33554432 float sum -1 500.9 267.94 468.90 0 500.0 268.42 469.74 0 268435456 67108864 float sum -1 989.0 271.41 474.97 0 988.9 271.45 475.05 0 536870912 134217728 float sum -1 1967.4 272.88 477.54 0 1966.0 273.08 477.88 0 1073741824 268435456 float sum -1 3908.5 274.72 480.77 0 3904.6 274.99 481.24 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 218.734 ``` Perf number for allreduce10 ``` # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 256 float sum -1 5.60 0.18 0.32 0 5.52 0.19 0.32 0 2048 512 float sum -1 5.79 0.35 0.62 0 5.64 0.36 0.64 0 4096 1024 float sum -1 5.92 0.69 1.21 0 5.82 0.70 1.23 0 8192 2048 float sum -1 6.03 1.36 2.38 0 5.95 1.38 2.41 0 16384 4096 float sum -1 6.58 2.49 4.35 0 6.39 2.56 4.49 0 32768 8192 float sum -1 7.54 4.34 7.60 0 7.41 4.42 7.74 0 65536 16384 float sum -1 7.95 8.24 14.42 0 8.10 8.09 14.16 0 131072 32768 float sum -1 9.56 13.72 24.00 0 9.47 13.84 24.23 0 262144 65536 float sum -1 11.49 22.81 39.92 0 11.41 22.97 40.20 0 524288 131072 float sum -1 14.19 36.94 64.64 0 13.88 37.76 66.09 0 1048576 262144 float sum -1 19.10 54.89 96.06 0 18.98 55.24 96.67 0 2097152 524288 float sum -1 31.12 67.38 117.91 0 31.34 66.92 117.10 0 4194304 1048576 float sum -1 44.88 93.46 163.56 0 44.76 93.70 163.97 0 8388608 2097152 float sum -1 63.23 132.68 232.18 0 62.53 134.14 234.75 0 16777216 4194304 float sum -1 106.8 157.03 274.80 0 105.9 158.46 277.30 0 33554432 8388608 float sum -1 172.2 194.91 341.09 0 172.0 195.05 341.35 0 67108864 16777216 float sum -1 299.8 223.83 391.70 0 300.8 223.12 390.46 0 134217728 33554432 float sum -1 553.1 242.66 424.66 0 553.8 242.38 424.16 0 268435456 67108864 float sum -1 1056.1 254.18 444.82 0 1057.4 253.86 444.26 0 536870912 134217728 float sum -1 2064.0 260.11 455.20 0 2063.8 260.14 455.25 0 1073741824 268435456 float sum -1 4074.4 263.53 461.18 0 4065.8 264.09 462.16 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 169.799 ``` --------- Co-authored-by: Sreevatsa Anantharamu <sreevatsanadig@gmail.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2025-04-27 14:09:31 -07:00
Changho Hwang	b310783603	Fix #508 (#515 ) * Wrong offsets in `unpackPackets()` * Added Python binding of `BaseMemoryChannel`	2025-04-25 09:52:05 -07:00
Changho Hwang	710f6686dc	Revised MemoryChannel interfaces (#508 ) * Moved the `MemoryChannel::copy()` method out of the `MemoryChannel` as a standalone function. * Renamed `mscclpp::putPackets()` and `mscclpp::getPackets()` to `mscclpp::copyToPackets()` and `mscclpp::copyFromPackets()` respectively for consistency. * Renamed `MemoryChannel::getPackets()` to `MemoryChannel::unpackPackets()` for clarity. Renamed `getPacketBuffer` to `packetBuffer`. * Added the `MemoryChannel::unpackPacket()` method that unpacks one packet in the buffer. * Added the `BaseMemoryChannel` class that only contains a semaphore without memory addresses. * Removed the `MemoryDevice2DeviceSemaphoreDeviceHandle::signalPacket()` method that is lacking use cases.	2025-04-25 00:02:56 +00:00
Changho Hwang	cbdcf9064c	Use implicit ctors for default device ctors (#512 ) By using implicit constructors, the compiler doesn't need to dynamically initialize the instances.	2025-04-24 12:38:19 -07:00
Binyang Li	7da11b35d5	Add flag to disable nvls (#500 ) Mitigate this issue: #496, for now `ibv_reg_dmabuf_mr` is not supported by Azure vm. Add this flag to force to use cudaMalloc for memory allocation and disable nvls feature	2025-04-22 17:09:19 -07:00
Binyang Li	06f31994dc	Fix performance issue introduced in PR: 499 (#505 ) 1. use `fence+relaxed` to replace `release` for fifo. `fence+relax` is more efficient on A100 2. Update the deviceSyncer. Previous one cannot handle threadBlock number change correctly. Use three counters to solve this issue. Reset previous counter before sync on current counter. 3. Introduce relaxedWait which can be used with relaxedSignal for case doesn't need guarantee the memory visibility	2025-04-22 14:03:37 -07:00
Binyang Li	e412804eab	Improve signal/wait performance and fix barrier issue (#499 ) Remove __assert_fail for release build. This will reduce the number of PTX instructions inside the loop. Also Trying to resolve this issue reported in #497. Reduce the number of PTX instructions from 8 to 6. 8 ranks signal/wait will reduce from 3.2us->2.8us on NDv5 Also NDEBUG flag is confused here, sometime it will not be set. Use customized flag for debug build. Here is current PTX: ``` ld.u64 %rd12, [%rd2+-24]; mov.u64 %rd13, %rd12; mov.u64 %rd11, %rd13; ld.acquire.sys.b64 %rd10,[%rd11]; setp.lt.u64 %p1, %rd10, %rd3; @%p1 bra $L__BB2_1; ``` If we change to `asm volatile("ld.global.acquire.sys.b64 %0, [%1];" : "=l"(flag) : "l"(flag_addr));` will reduce to 4 instructions. We can get 2.1 us for 8 ranks signal/wait ``` ld.u64 %rd9, [%rd1+-24]; ld.global.acquire.sys.b64 %rd8, [%rd9]; setp.lt.u64 %p1, %rd8, %rd2; @%p1 bra $L__BB2_1; ```	2025-04-16 14:22:10 -07:00
Binyang Li	adc9ee5684	Export mscclpp GpuBuffer to dlpack format (#492 ) For mscclpp, to use nvls we require the buffer is allocated by mscclpp::GpuBuffer. Due to cupy doesn't support bfloat16 yet, we export the raw buffer to dlpack format. User can use this feature to create buffer with type supported by pytorch ```python buffer = RawGpuBuffer(1024 * 2) # 2 for bfloat16 dl_pack = buffer.to_dlpack(str(torch.bfloat16)) tensor = torch.utils.dlpack.from_dlpack(dl_pack) ```	2025-04-03 12:59:32 -07:00

1 2 3 4

152 Commits