mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-06-29 02:47:23 +00:00

Author	SHA1	Message	Date
Empyreus	93f96e97cd	Merge branch 'main' into rjsouza/nvls-allgather-pr	2026-06-23 23:27:24 +00:00
Caio Rocha	88e1a44858	wip	2026-06-23 22:41:27 +00:00
Caio Rocha	2091a337be	wip	2026-06-23 09:04:05 +00:00
Caio Rocha	f74814142e	wip	2026-06-23 08:32:14 +00:00
Binyang Li	dc0b8d75f3	GB200 support: SendRecv DSL collective and per-channel executor connections (#810 ) ## Summary GB200 support work: introduces point-to-point send/receive in the MSCCL++ DSL and extends the executor for split-NVL-domain topologies where some ranks are NVL-connected within a node and other ranks must communicate across the network. ### DSL - New `SendRecv` collective with separate input/output buffers (`python/mscclpp/language/collectives.py`). - New multi-node sendrecv DSL example (`python/mscclpp/language/tests/multi_node/send_recv.py`) with `--split_mask` (group size − 1) and `--instances` CLI options. Documents the channel-ordering trick that keeps signal tags cross-matched between paired peers when `prev == next`. - `BaseBuffer.__getitem__` now accepts slices with `None` start/stop (e.g., `buf[:]`). ### Executor - One connection (unique QP) per channel entry instead of one per peer. Required for HostNoAtomic IB mode where each QP can forward signals to a single semaphore. Uses per-peer tag counters so paired ranks agree on tag ordering regardless of the order peers appear in each rank's `connected_to` list. - MEMORY channels now unconditionally use `Transport::CudaIpc`; only PORT channels can use IB. Matches the invariant already enforced by `getTransportFlags`. - `ExecutionContext::connections` is now a `vector<Connection>` indexed by channel order (was `unordered_map<int, Connection>` keyed by peer). Removes redundant semaphore fields from `ExecutionContext`. - TODO: explicit NVL-domain check in `useIB` --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2026-06-19 13:19:01 -07:00
Empyreus	e342039f88	add comments	2026-06-18 17:50:18 +00:00
Empyreus	7813f3b1b0	remove execution_kernel alignment check	2026-06-18 17:00:49 +00:00
Empyreus	9a02f3669f	enforce scratch on boardcast packets	2026-06-18 16:17:56 +00:00
Empyreus	253bc05c7c	add auto_sync false flag	2026-06-18 00:13:36 +00:00
Empyreus	9b4175412b	combine groupstore and groupstorepacket	2026-06-18 00:07:38 +00:00
Empyreus	24187fcded	use .to_dict()	2026-06-18 00:01:41 +00:00
Empyreus	e48c6da34b	fix formatting	2026-06-17 00:11:06 +00:00
Empyreus	1d95244cf9	tune for ptk	2026-06-16 19:46:14 +00:00
Empyreus	1e17f20618	add allgather packet algorithm	2026-06-11 20:46:51 +00:00
Empyreus	02eb2cfc2e	add support for allgather packet for small message sizes	2026-06-11 20:46:09 +00:00
Empyreus	5d7737437a	handle non 16bit aligned	2026-06-09 18:54:38 +00:00
Empyreus	6f61c014c1	fix missing flag	2026-06-09 18:07:25 +00:00
Empyreus	3b263c3324	revert useIB change	2026-06-09 16:02:33 +00:00
Empyreus	5348c4a774	refactor function for thread usage	2026-06-08 21:53:46 +00:00
Empyreus	1e1187aa65	reuse useIB function	2026-06-08 21:53:16 +00:00
Empyreus	f2204ee569	improve variable names	2026-06-08 21:02:39 +00:00
Empyreus	b864746083	python formatting fixes	2026-06-08 20:28:34 +00:00
Empyreus	2b87985927	update to type agnostic	2026-06-08 20:04:33 +00:00
Empyreus	0d8efdb43d	update algo	2026-06-08 20:04:33 +00:00
Empyreus	00668b4a41	add allgather gstore support	2026-06-08 20:04:33 +00:00
Empyreus	54bfb1d3b7	add flag to disable IB	2026-06-08 20:04:33 +00:00
Binyang Li	7c390fffd6	Expose NVLS multicast granularity option for GpuBuffer (#815 ) Add a public Granularity enum (MultiCastMinimum, MultiCastRecommended) and let GpuBuffer choose the NVLS multicast allocation granularity via a constructor argument, defaulting to MultiCastMinimum to minimize memory usage. Expose the same option through the C++ and Python (nanobind) APIs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-06-04 13:16:18 -07:00
Binyang Li	c9f8be64bb	Add collective benchmark and correctness check (#814 ) - Add unit-test for float8_e4m3b15 data type. - And tuner and benchmark for allreduce/allgather algo, make sure the correctness and performance.	2026-06-04 09:22:10 -07:00
RJ Souza	29d5beb348	Adding Support for SGLang CI Tests (#800 ) Adds Azure DevOps pipelines, templates, and supporting scripts to run SGLang end-to-end and benchmark tests against MSCCL++ on H100 GPU nodes, plus the Docker image and small infrastructure tweaks needed to make those pipelines runnable. --------- Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-05-27 14:57:21 -07:00
Binyang Li	379d0e51e4	Fix for allgather_fullmesh algo (#813 )	2026-05-26 12:34:13 -07:00
Binyang Li	08ee18be64	Add check to filter invalid nblock/nthread candidates (#811 ) Add check for invalid nblock/nthread candidate	2026-05-22 09:18:41 -07:00
Binyang Li	9e177b388c	remove useless sync (#809 )	2026-05-20 16:49:49 -07:00
Binyang Li	72621e7221	add nBlocks check for allreduce_allpair_packet algo (#807 ) - Fix the correctness issue for allreduce_allpair_packet algo. Make sure no overwrite for input buffer. Use same tb for send/reduce/write-back. - Check if nBlocks/nthreads validate for packet algorithm. - Add more logs - Modify flag update logic, make it work for the case: nthreadPerNBlock < nflags --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-20 09:29:55 -07:00
Caio Rocha	c1071318c8	Include a static synchronization check in the DSL. (#806 )	2026-05-19 13:06:53 -07:00
Binyang Li	60a6d7219f	Clean up completed communicator receives (#804 ) ## Summary - Release the reference after last requests are ready. - Keep ordered receive chaining for repeated rank/tag operations while cleaning up completed receive bookkeeping. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-15 21:06:50 +00:00
Changho Hwang	252a422030	Handle PortChannel flush asynchronously from the host proxy (#802 ) When a `PortChannel` requests `flush()`, the host-side proxy was being blocked, which may cause head-of-line blocking of other parallel `PortChannel`s' requests. Now the proxy handles `flush()` requests asynchronously. This feature especially helps performance when we need multiple IB QPs and need to flush QPs.	2026-05-15 11:50:43 -07:00
Changho Hwang	5d608feaa5	Enhance cross-node CudaIpc availability check (#803 )	2026-05-14 14:06:12 -07:00
Caio Rocha	40295df4c4	Adding Support to bf16 Executor Tests (#801 ) This pull request adds support for the `bfloat16` (bf16) data type to the test executor, including both Python and CUDA components. The changes ensure that `bfloat16` is handled consistently across argument parsing, data type conversion, and test kernel implementations. Additionally, the CUDA verification kernels are refactored to use parameterized tolerances for improved numerical accuracy checks. Support for bfloat16 data type: * Added handling for `bfloat16`/`bf16` in the Python test executor's argument parsing, data type conversion (`parse_dtype`, `dtype_to_mscclpp_dtype`), and help text. [[1]](diffhunk://#diff-e643968a8622d1603868a8ecf4b2fcd8108be1e404a3420bb7e2a6d51dc23fdcR27-R28) [[2]](diffhunk://#diff-e643968a8622d1603868a8ecf4b2fcd8108be1e404a3420bb7e2a6d51dc23fdcL122-R135) [[3]](diffhunk://#diff-e643968a8622d1603868a8ecf4b2fcd8108be1e404a3420bb7e2a6d51dc23fdcL246-R251) * Updated output to display the correct data type string for `bfloat16`. CUDA kernel and test improvements: * Included `bfloat16` headers and defined test data fill and gather kernels for `bfloat16` on both CUDA and HIP platforms. [[1]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R8-R11) [[2]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R35) [[3]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R54-R59) [[4]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88R133) * Refactored verification kernels (`ALL_REDUCE`, `REDUCE_SCATTER`) to use an explicit tolerance parameter (`Eps`) and added correct tolerances for each data type, including `bfloat16`. [[1]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88L69-R85) [[2]](diffhunk://#diff-e18b8becff1c3b234733f5ca3250a76ffdc5edddb302c2da098b64b00ba7cf88L94-R113) These changes ensure full support for `bfloat16` in the test executor and improve the accuracy and maintainability of the CUDA test kernels. --------- Co-authored-by: Caio Rocha <caiorocha@microsof.com>	2026-05-14 09:56:11 -07:00
Caio Rocha	0c9b9abfd5	Adding Support 4 Nodes AllReduce Small Message Size (#794 ) Results on 4 Nodes H200: \| Size \| NCCL \| MSCCL++ 57TB \| MSCCL++ 29TB \| \|------\|-------\|--------------\|--------------\| \| 8K \| 45.75 \| 17.74 \| 18.18 \| \| 16K \| 47.08 \| 18.9 \| 18.42 \| \| 32K \| 47.29 \| 19.48 \| 19.12 \| \| 64K \| 50.34 \| 20.51 \| 19.29 \| \| 128K \| 59.65 \| 21.37 \| 20.25 \| \| 256K \| 87.46 \| 23.87 \| 23.51 \| \| 512K \| 106.55\| 29.15 \| 29.51 \| \| 1M \| 115 \| 40.64 \| 41.83 \| \| 2M \| 135.89\| 63.73 \| 70.45 \| \| 4M \| 177.59\| 121.76 \| 128.79 \| \| 8M \| 251.17\| 228.5 \| 251.36 \| --------- Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Caio Rocha <caiorocha@microsof.com> sglang-v0.9.1	2026-05-12 13:45:55 -07:00
Mahdieh Ghazi	822fbb2351	Adding necessary macros for enabling mrc support (#797 ) This PR adds necessary macros and instructions for enabling mrc support with no atomic.	2026-05-05 17:17:41 -04:00
Binyang Li	9ec26fa4d1	Reset GPU tokens before reuse (#795 ) Fixes a token-reuse bug in `TokenPool` that's independent of MNNVL. ## Bug `TokenPool` hands out 8-byte device-memory slots used as device-semaphore counters. The deleter only cleared the bitmap — the underlying GPU memory was left as-is. When a token was freed and later re-allocated, the new semaphore inherited the previous counter value instead of starting at 0, breaking subsequent `signal()/wait()` math. ## Fix * Add a synchronous `gpuMemset` host helper (mirrors `gpuMemcpy` / `gpuMemcpyAsync`). * Zero the slot inside the `TokenPool` deleter so recycled tokens hand out a clean counter. The very-first allocation is already zeroed by `gpuCallocPhysical` (`src/core/gpu_utils.cc:227-228`), so first-time tokens are also clean — the deleter only has to handle the recycle case. ## Notes * Public wrapper is named `mscclpp::gpuMemset` (not `mscclpp::memset`) for symmetry with `gpuMemcpy` and to avoid shadowing `std::memset` in TUs that pull the namespace in. * Zeroing happens on release rather than acquire so the cost is paid in the typically less perf-sensitive teardown path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-04 15:11:47 -07:00
Binyang Li	2c52937b26	Fix FP8 ROCm build/test issues and dtype naming (#792 ) ## Summary - Fix ROCm FP8 build failure by using the actual FP8 `DataType` enum constants in allreduce packet tuning. - Fix FP8 E4M3FNUZ test encoding so small negative values do not produce the FNUZ NaN byte (`0x80`). - Align FP8 `DataType` enum constants and Python bindings with torch-style names (`FLOAT8_E4M3FN`, `FLOAT8_E4M3FNUZ`, `FLOAT8_E5M2FNUZ` / `float8_e4m3fn`, `float8_e4m3fnuz`, `float8_e5m2fnuz`). ## Validation - `./tools/lint.sh` - `make -j` from `build/` - `mpirun --allow-run-as-root -np 8 python3 -m pytest python/test/test_fp8_accum.py -q` (`36 passed, 9 skipped`) - `DTYPE=float8_e4m3fnuz ACCUM_DTYPE=float32 torchrun --nnodes=1 --nproc_per_node=8 examples/torch-integration/customized_comm_with_tuning.py` --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-28 15:02:22 -07:00
Changho Hwang	c97be492d5	GDRCopy status message to string (#793 )	2026-04-27 10:32:20 -07:00
Copilot	e874bf1666	fix: isCuMemMapAllocated crashes on non-NVLS systems even with MSCCLPP_FORCE_DISABLE_NVLS=true (#790 ) - [x] Fix `isCuMemMapAllocated()` to just return `true/false` without throwing when NVLS is not supported - [x] Fix `isNvlsSupported()` caching bug where `result`/`isChecked` were never updated - [x] Restore `[[maybe_unused]]` on `result` and `isChecked` statics — needed in HIP/ROCm env where `CUDA_NVLS_API_AVAILABLE` is not defined and the variables would otherwise be unused - [x] Run linter (`./tools/lint.sh`) --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>	2026-04-22 10:12:40 -07:00
Binyang Li	eeea00b298	Support python wheel build (#787 ) ## Support Python wheel build This PR modernizes the Python packaging for MSCCL++ by defining dependencies and optional extras in `pyproject.toml`, enabling proper wheel builds with `pip install ".[cuda12]"`. ### Changes `pyproject.toml` - Add `dependencies` (numpy, blake3, pybind11, sortedcontainers) - Add `optional-dependencies` for platform-specific CuPy (`cuda11`, `cuda12`, `cuda13`, `rocm6`), `benchmark`, and `test` extras - Bump minimum Python version from 3.8 to 3.10 `test/deploy/setup.sh` - Use `pip install ".[<platform>,benchmark,test]"` instead of separate `pip install -r requirements_.txt` + `pip install .` steps - Add missing CUDA 13 case `docs/quickstart.md`* - Update install instructions to use extras (e.g., `pip install ".[cuda12]"`) - Document all available extras and clarify that `rocm6` builds CuPy from source - Update Python version references to 3.10 `python/csrc/CMakeLists.txt`, `python/test/CMakeLists.txt` - Update `find_package(Python)` from 3.8 to 3.10 ### Notes - The `requirements_*.txt` files are kept for Docker base image builds where only dependencies (not the project itself) should be installed. - CuPy is intentionally not in base dependencies — users must specify a platform extra to get the correct pre-built wheel (or source build for ROCm). --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-16 21:24:45 -07:00
Binyang Li	572028ea3d	Fix nccl-test CI building for all GPU architectures (#786 ) ## Problem `nccl-test.yml` was the only CI template calling `deploy.yml` without passing `gpuArch`. Since the CI build machine has no GPU, CMake fell back to building for all supported architectures (`80;90;100;120`), unnecessarily slowing down CI builds. ## Fix - Add `gpuArch` parameter to `nccl-test.yml` and forward it to `deploy.yml` - Pass `gpuArch: '80'` (A100) and `gpuArch: '90'` (H100) from `nccl-api-test.yml` All other templates were already passing `gpuArch` correctly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-15 12:55:40 -07:00
Binyang Li	ecd33722d4	Fix multi-node H100 CI: CUDA compat, deploy improvements (#781 ) ## Summary - Multi-node H100 CI setup: Improve architecture detection and GPU configuration - Remove hardcoded VMSS hostnames from deploy files - Fix CUDA compat library issue: Remove stale compat paths from Docker image for CUDA 12+. Instead, `peer_access_test` now returns a distinct exit code (2) for CUDA init failure, and `setup.sh` conditionally adds compat libs only when needed. This fixes `cudaErrorSystemNotReady` (error 803) when the host driver is newer than the container's compat libs. - Speed up deploy: Replace recursive `parallel-scp` with tar+scp+untar to avoid per-file SSH overhead. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-13 21:51:29 -07:00
Caio Rocha	b6d0ca13ca	Adding CI Test to DSL Executor (#782 )	2026-04-13 13:55:45 -07:00
Caio Rocha	b59e6d7f00	Updating NpKit (#785 )	2026-04-13 13:36:42 -07:00
Binyang Li	5380a4ac6e	Add MSCCLPP_IB_GID_INDEX env (#780 ) Use MSCCLPP_IB_GID_INDEX to control ib gid index --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-13 09:59:42 -07:00

1 2 3 4 5 ...

976 Commits