## Support Python wheel build
This PR modernizes the Python packaging for MSCCL++ by defining
dependencies and optional extras in `pyproject.toml`, enabling proper
wheel builds with `pip install ".[cuda12]"`.
### Changes
**`pyproject.toml`**
- Add `dependencies` (numpy, blake3, pybind11, sortedcontainers)
- Add `optional-dependencies` for platform-specific CuPy (`cuda11`,
`cuda12`, `cuda13`, `rocm6`), `benchmark`, and `test` extras
- Bump minimum Python version from 3.8 to 3.10
**`test/deploy/setup.sh`**
- Use `pip install ".[<platform>,benchmark,test]"` instead of separate
`pip install -r requirements_*.txt` + `pip install .` steps
- Add missing CUDA 13 case
**`docs/quickstart.md`**
- Update install instructions to use extras (e.g., `pip install
".[cuda12]"`)
- Document all available extras and clarify that `rocm6` builds CuPy
from source
- Update Python version references to 3.10
**`python/csrc/CMakeLists.txt`**, **`python/test/CMakeLists.txt`**
- Update `find_package(Python)` from 3.8 to 3.10
### Notes
- The `requirements_*.txt` files are kept for Docker base image builds
where only dependencies (not the project itself) should be installed.
- CuPy is intentionally not in base dependencies — users must specify a
platform extra to get the correct pre-built wheel (or source build for
ROCm).
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Provides two integration ways for MSCCL++ DSL.
1. Integrate with customized communication group
2. Integrate with NCCL API
Introduce new Python APIs to make it work:
```python
mscclpp.compile # compile dsl to json based execution plan
mscclpp.ExecutionPlanRegistry.register_plan(plan) # register the compiled plan to executionPlanRegistery
mscclpp.ExecutionPlanRegistry.set_selector(selector) # set the selector, the selector will return the best execution plan based on collection, message size, world size....
```
Fix#556
---------
Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
The PR contains following changes:
Python side:
- Channel based DSL implementation: decouple channel with chunk.
- Users create channel explicitly, only need local_rank, remote_rank and
channel_type
- Adjust executor json file, add remote_buffer fields, different op can
use different channel and remote buffers combination.
- Reimplement operation fusion, data dependency check mechanism
- Add new op such as semaphore, pipeline
- Clean code and enhance document
C++ side:
- Support new execution file json format
- Support semaphore and pipeline operation
- code clean, support non-zero copy scenario
---------
Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>