mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-05-13 01:36:10 +00:00
Support E4M3B15 datatype (#765)
## Summary - **Add `fp8_e4m3b15` datatype**: A software-defined FP8 type with 4 exponent bits, 3 mantissa bits, and bias=15 (max finite value: 0.9375). Implemented entirely in software with no HW dependency, using Triton-style bit manipulation through fp16 as intermediate for efficient conversion. - **Add mixed-precision accumulation for allreduce**: All allreduce algorithm variants (packet, NVLS packet, fullmesh, RSAG zero-copy, and others) now support a configurable `accumDtype` parameter, enabling FP8 inputs to be reduced in float16 or float32 for higher accuracy. - **Propagate `accumDtype` through the full API**: The new parameter is threaded from `Algorithm::execute()` → `NativeAlgorithm` → `KernelFunc` → dispatch → CUDA kernels, with `DataType::AUTO` as the default (resolves to input dtype at runtime). - **Add FP8 accumulation correctness tests**: New `test_fp8_accum.py` validates that higher-precision accumulation produces results at least as accurate as native FP8 accumulation across multiple algorithms and sizes. Skipped on CUDA SM < 89 (pre-Hopper); runs on HIP/ROCm. - **Add `test_fp8_accum.py` to CI**: Azure Pipeline `ut.yml` now runs FP8 accumulation tests alongside existing pytests. - **NCCL shim logging cleanup**: Migrated `printf`-style `WARN`/`INFO` calls to streaming-style logging. ## Key files | Area | Files | |------|-------| | New datatype + vector ops | `include/mscclpp/gpu_data_types.hpp` | | Accumulation reduce helpers | `src/core/include/reduce_kernel.hpp` | | Algorithm API (`accumDtype`) | `include/mscclpp/algorithm.hpp`, `src/core/algorithm.cc` | | Allreduce kernels | `src/ext/collectives/allreduce/*.cu` | | Dispatch + common | `src/ext/collectives/include/allreduce/common.hpp` | | Python bindings | `python/csrc/algorithm.cpp`, `python/mscclpp/_core/algorithm.py` | | Tests | `python/test/test_fp8_accum.py` | | CI | `.azure-pipelines/templates/ut.yml` | ## Test plan - [x] CI passes on H100 (CUDA SM 90) — full FP8 E4M3 + E4M3B15 accumulation tests - [x] CI passes on A100 (CUDA SM 80) — FP8 tests correctly skipped - [x] CI passes on MI300X (ROCm) — FP8 tests run via HIP - [x] Existing `test_mscclpp.py` tests continue to pass - [x] NCCL shim builds and runs correctly with new `accumDtype` defaults 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
@@ -177,6 +177,7 @@ class Algorithm:
|
||||
nthreads_per_block=0,
|
||||
symmetric_memory: bool = False,
|
||||
extras: Optional[Dict[str, int]] = None,
|
||||
accum_dtype: Optional[CppDataType] = None,
|
||||
) -> int:
|
||||
"""Execute the collective algorithm.
|
||||
|
||||
@@ -194,10 +195,14 @@ class Algorithm:
|
||||
nthreads_per_block: Number of threads per block (0 for auto-selection).
|
||||
symmetric_memory: Whether to use symmetric memory optimization (default: False).
|
||||
extras: Additional algorithm-specific parameters.
|
||||
accum_dtype: Data type for accumulation during reduction. If None, defaults to
|
||||
the same as dtype. Use DataType.float32 for high-precision FP8 accumulation.
|
||||
|
||||
Returns:
|
||||
The result code (0 for success).
|
||||
"""
|
||||
merged_extras = dict(extras) if extras is not None else {}
|
||||
accum_dtype = accum_dtype if accum_dtype is not None else dtype
|
||||
return self._algorithm.execute(
|
||||
comm,
|
||||
int(input_buffer),
|
||||
@@ -211,7 +216,8 @@ class Algorithm:
|
||||
nblocks,
|
||||
nthreads_per_block,
|
||||
symmetric_memory,
|
||||
extras if extras is not None else {},
|
||||
merged_extras,
|
||||
int(accum_dtype),
|
||||
)
|
||||
|
||||
def reset(self):
|
||||
|
||||
Reference in New Issue
Block a user