mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-11 17:00:22 +00:00

Author	SHA1	Message	Date
empyreus	812d43d406	return failed result for new test	2026-05-04 20:40:12 +00:00
empyreus	f6637cc458	attempt to print nvidia-smi for cuda drivers Co-authored-by: Copilot <copilot@github.com>	2026-05-04 20:11:21 +00:00
empyreus	21197f7c0a	change directory	2026-05-04 20:02:34 +00:00
empyreus	dfdc9f701e	update pool	2026-05-04 18:29:54 +00:00
empyreus	eaa611f220	split multi node test	2026-05-04 18:10:42 +00:00
empyreus	de244e528b	update sglang bench	2026-05-04 18:04:30 +00:00
empyreus	97a4b1aa69	remove duplicate stop	2026-05-04 17:23:01 +00:00
empyreus	cb430b35d4	clean up deploy	2026-05-04 17:21:56 +00:00
empyreus	a8b959946a	Inital new test	2026-05-04 17:18:02 +00:00
empyreus	e091f65143	Merge branch 'main' into rjsouza/sglang-tests	2026-05-04 17:06:18 +00:00
Changho Hwang	c97be492d5	GDRCopy status message to string (#793 )	2026-04-27 10:32:20 -07:00
Copilot	e874bf1666	fix: isCuMemMapAllocated crashes on non-NVLS systems even with MSCCLPP_FORCE_DISABLE_NVLS=true (#790 ) - [x] Fix `isCuMemMapAllocated()` to just return `true/false` without throwing when NVLS is not supported - [x] Fix `isNvlsSupported()` caching bug where `result`/`isChecked` were never updated - [x] Restore `[[maybe_unused]]` on `result` and `isChecked` statics — needed in HIP/ROCm env where `CUDA_NVLS_API_AVAILABLE` is not defined and the variables would otherwise be unused - [x] Run linter (`./tools/lint.sh`) --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>	2026-04-22 10:12:40 -07:00
Binyang Li	eeea00b298	Support python wheel build (#787 ) ## Support Python wheel build This PR modernizes the Python packaging for MSCCL++ by defining dependencies and optional extras in `pyproject.toml`, enabling proper wheel builds with `pip install ".[cuda12]"`. ### Changes `pyproject.toml` - Add `dependencies` (numpy, blake3, pybind11, sortedcontainers) - Add `optional-dependencies` for platform-specific CuPy (`cuda11`, `cuda12`, `cuda13`, `rocm6`), `benchmark`, and `test` extras - Bump minimum Python version from 3.8 to 3.10 `test/deploy/setup.sh` - Use `pip install ".[<platform>,benchmark,test]"` instead of separate `pip install -r requirements_.txt` + `pip install .` steps - Add missing CUDA 13 case `docs/quickstart.md`* - Update install instructions to use extras (e.g., `pip install ".[cuda12]"`) - Document all available extras and clarify that `rocm6` builds CuPy from source - Update Python version references to 3.10 `python/csrc/CMakeLists.txt`, `python/test/CMakeLists.txt` - Update `find_package(Python)` from 3.8 to 3.10 ### Notes - The `requirements_*.txt` files are kept for Docker base image builds where only dependencies (not the project itself) should be installed. - CuPy is intentionally not in base dependencies — users must specify a platform extra to get the correct pre-built wheel (or source build for ROCm). --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-16 21:24:45 -07:00
Binyang Li	572028ea3d	Fix nccl-test CI building for all GPU architectures (#786 ) ## Problem `nccl-test.yml` was the only CI template calling `deploy.yml` without passing `gpuArch`. Since the CI build machine has no GPU, CMake fell back to building for all supported architectures (`80;90;100;120`), unnecessarily slowing down CI builds. ## Fix - Add `gpuArch` parameter to `nccl-test.yml` and forward it to `deploy.yml` - Pass `gpuArch: '80'` (A100) and `gpuArch: '90'` (H100) from `nccl-api-test.yml` All other templates were already passing `gpuArch` correctly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-15 12:55:40 -07:00
Binyang Li	ecd33722d4	Fix multi-node H100 CI: CUDA compat, deploy improvements (#781 ) ## Summary - Multi-node H100 CI setup: Improve architecture detection and GPU configuration - Remove hardcoded VMSS hostnames from deploy files - Fix CUDA compat library issue: Remove stale compat paths from Docker image for CUDA 12+. Instead, `peer_access_test` now returns a distinct exit code (2) for CUDA init failure, and `setup.sh` conditionally adds compat libs only when needed. This fixes `cudaErrorSystemNotReady` (error 803) when the host driver is newer than the container's compat libs. - Speed up deploy: Replace recursive `parallel-scp` with tar+scp+untar to avoid per-file SSH overhead. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-13 21:51:29 -07:00
Caio Rocha	b6d0ca13ca	Adding CI Test to DSL Executor (#782 )	2026-04-13 13:55:45 -07:00
Caio Rocha	b59e6d7f00	Updating NpKit (#785 )	2026-04-13 13:36:42 -07:00
Binyang Li	5380a4ac6e	Add MSCCLPP_IB_GID_INDEX env (#780 ) Use MSCCLPP_IB_GID_INDEX to control ib gid index --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-13 09:59:42 -07:00
Caio Rocha	feda338595	Adjusting Torch Integration Example (#779 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2026-04-10 13:57:14 -07:00
Changho Hwang	d63f9403c0	IB `host-no-atomic`: GDRCopy + mlx5dv Data Direct for memory-consistent low-latency signaling (#753 ) Major enhancements to the IB signal forwarding mechanisms (`host-no-atomic` mode), primarily adding support for GDRCopy and MLX5 Direct Verbs, and refactoring the signal forwarding path for IB HostNoAtomic mode. The changes fix memory consistency issues and reduce signaling latency. - GDRCopy and MLX5 Direct Verbs MR integration - Signal forwarding path redesign - Semaphore and connection API updates - Environment (`MSCCLPP_FORCE_DISABLE_GDR`) and documentation updates	2026-04-09 09:24:30 +00:00
Caio Rocha	a7273047e9	Fix TBG on DSL Get Operation (#778 )	2026-04-08 17:02:07 -07:00
Caio Rocha	3e5c41c98a	Adding Channel Type in ReduceSend Operation on DSL (#777 ) The reduce send operation in DSL essentially combines the reduce and put operations. The put operation carry the information about the channel type, whereas previously, we were using the channel type from the reduce operation.	2026-04-08 16:59:08 -07:00
Qinghua Zhou	ed565ceb33	Fix missing directory of document for new tag v0.9.0 (#776 ) The v0.9.0 conf.py (introduced in #775) dynamically loads the version from python/mscclpp/_version.py. This file is generated at build time by setuptools_scm and is listed in .gitignore — it is never committed to the repo. Earlier tags (v0.8.0 and below) used a hardcoded release (e.g., "v0.8.0") in conf.py, so they had no dependency on generated files. sphinx-multiversion checks out each tag using git archive, which only extracts committed files. Since _version.py is not committed, the v0.9.0 checkout is missing it, and conf.py crashes on import. All future tags will have this same problem. Three changes: 1. docs/build_multiversion.py (new): A wrapper around sphinx-multiversion that monkey-patches copy_tree to generate _version.py in each tag checkout after extraction. The version string is parsed from the tag name (e.g., v0.9.0 → __version__ = "0.9.0"). 2. Makefile: The multiversion target now calls build_multiversion.py instead of sphinx-multiversion directly. 3. conf.py: Added a fallback so that if _version.py doesn't exist, it reads the version from the VERSION file instead. This makes conf.py resilient for any future scenario where _version.py is missing. Testing Verified locally: • make multiversion now successfully builds all 11 versions (v0.4.0 through v0.9.0) • v0.9.0 docs are correctly generated under _build/html/v0.9.0/ Version selector shows v0.9.0 as latest v0.9.0	2026-04-08 17:59:05 -04:00
empyreus	14f75d8e76	fix pool	2026-04-08 19:00:16 +00:00
empyreus	e1687e885f	update to h100 multinode	2026-04-08 18:55:38 +00:00
Binyang Li	8896cd909a	Add ROCm FP8 E4M3B15 support (#774 ) ## Summary Add ROCm (gfx942) support for the FP8 E4M3B15 data type, including optimized conversion routines between FP8 E4M3B15 and FP16/FP32 using inline assembly. Extends the allpair packet and fullmesh allreduce kernels to support higher-precision accumulation (e.g., FP16/FP32) when reducing FP8 data, improving numerical accuracy. Adds Python tests to verify that higher-precision accumulation is at least as accurate as native FP8 accumulation across all algorithm variants. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-08 09:53:45 -07:00
Mahdieh Ghazi	e66ce39647	Mahdieh/update version number (#775 ) Update the version number for v0.9.0	2026-04-08 12:38:56 -04:00
Binyang Li	96a72bbd3e	Support E4M3B15 datatype (#765 ) ## Summary - Add `fp8_e4m3b15` datatype: A software-defined FP8 type with 4 exponent bits, 3 mantissa bits, and bias=15 (max finite value: 0.9375). Implemented entirely in software with no HW dependency, using Triton-style bit manipulation through fp16 as intermediate for efficient conversion. - Add mixed-precision accumulation for allreduce: All allreduce algorithm variants (packet, NVLS packet, fullmesh, RSAG zero-copy, and others) now support a configurable `accumDtype` parameter, enabling FP8 inputs to be reduced in float16 or float32 for higher accuracy. - Propagate `accumDtype` through the full API: The new parameter is threaded from `Algorithm::execute()` → `NativeAlgorithm` → `KernelFunc` → dispatch → CUDA kernels, with `DataType::AUTO` as the default (resolves to input dtype at runtime). - Add FP8 accumulation correctness tests: New `test_fp8_accum.py` validates that higher-precision accumulation produces results at least as accurate as native FP8 accumulation across multiple algorithms and sizes. Skipped on CUDA SM < 89 (pre-Hopper); runs on HIP/ROCm. - Add `test_fp8_accum.py` to CI: Azure Pipeline `ut.yml` now runs FP8 accumulation tests alongside existing pytests. - NCCL shim logging cleanup: Migrated `printf`-style `WARN`/`INFO` calls to streaming-style logging. ## Key files \| Area \| Files \| \|------\|-------\| \| New datatype + vector ops \| `include/mscclpp/gpu_data_types.hpp` \| \| Accumulation reduce helpers \| `src/core/include/reduce_kernel.hpp` \| \| Algorithm API (`accumDtype`) \| `include/mscclpp/algorithm.hpp`, `src/core/algorithm.cc` \| \| Allreduce kernels \| `src/ext/collectives/allreduce/*.cu` \| \| Dispatch + common \| `src/ext/collectives/include/allreduce/common.hpp` \| \| Python bindings \| `python/csrc/algorithm.cpp`, `python/mscclpp/_core/algorithm.py` \| \| Tests \| `python/test/test_fp8_accum.py` \| \| CI \| `.azure-pipelines/templates/ut.yml` \| ## Test plan - [x] CI passes on H100 (CUDA SM 90) — full FP8 E4M3 + E4M3B15 accumulation tests - [x] CI passes on A100 (CUDA SM 80) — FP8 tests correctly skipped - [x] CI passes on MI300X (ROCm) — FP8 tests run via HIP - [x] Existing `test_mscclpp.py` tests continue to pass - [x] NCCL shim builds and runs correctly with new `accumDtype` defaults 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 13:37:02 -07:00
empyreus	1ad0f1c9d5	hostentries	2026-04-07 19:16:09 +00:00
empyreus	4f677b64c9	host entries	2026-04-07 19:15:11 +00:00
empyreus	512416edc2	add resourcegroup	2026-04-07 19:13:55 +00:00
empyreus	a1bc727e51	fix deploy step	2026-04-07 19:11:22 +00:00
empyreus	1fbcbfdec7	fix formatting	2026-04-07 19:06:56 +00:00
empyreus	d8e1de7a7f	fix formatting	2026-04-07 17:31:14 +00:00
empyreus	0bf599837d	try multi-pipeline	2026-04-07 17:29:47 +00:00
empyreus	8fb751470b	add multi node	2026-04-07 17:15:05 +00:00
Binyang Li	fa95e82e18	Fix CI/CD pipeline issues (#773 ) This pull request updates the deployment pipeline to allow custom CMake arguments to be passed to the pip install process on remote VMs. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-07 08:41:51 -07:00
empyreus	88e1ac71c7	fix paths	2026-04-06 20:53:39 +00:00
empyreus	ea97444a8d	change path	2026-04-06 19:44:14 +00:00
empyreus	68cf67d24e	unit test	2026-04-06 18:12:46 +00:00
empyreus	58c5234243	ignore version mismatch	2026-04-03 23:20:11 +00:00
empyreus	e68125f270	change to h100 machine	2026-04-03 22:20:22 +00:00
empyreus	e8266a1794	running on a100	2026-04-03 15:08:12 +00:00
empyreus	53d6f76a24	simplify container	2026-04-03 14:07:46 +00:00
empyreus	7b03ece609	add prints	2026-04-03 14:02:44 +00:00
empyreus	10648a42c5	add --priveldged	2026-04-02 20:35:15 +00:00
empyreus	149be8e828	fix -	2026-04-02 19:47:30 +00:00
empyreus	376a6a299d	rmobe build	2026-04-02 19:27:09 +00:00
empyreus	61e0540cbc	update for new cmake	2026-04-02 17:55:58 +00:00
empyreus	6fd8b18e83	change cmake version	2026-04-02 17:48:43 +00:00

1 2 3 4 5 ...

1029 Commits