mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-05-11 17:00:22 +00:00
ext/ep: add pragma once to event.hpp and update validation docs
- Add #pragma once to src/ext/ep/event.hpp; including it in multiple TUs would otherwise redefine EventHandle. - python/mscclpp/ext/ep/buffer.py: low-latency internode is now validated on 2x H100x8; remove the 'untested on multi-node H100' note. - src/ext/ep/kernels/internode_ll.cu: replace the untested-on-multi-node WARNING with the current validated-on-2x-H100x8 status. Addresses Copilot review comments on PR #796.
This commit is contained in:
@@ -17,8 +17,9 @@ Current status (see ``src/ext/ep/README.md``):
|
||||
* Internode HT (MSCCL++ PortChannel + MemoryChannel) dispatch and combine:
|
||||
ported and validated on 2 nodes x 8 H100 GPUs with
|
||||
``test/python/ext/ep/test_internode_multirank.py``.
|
||||
* Internode low-latency kernels: structural port (NVSHMEM/IBGDA ->
|
||||
MSCCL++ PortChannel), **untested on multi-node H100**.
|
||||
* Internode low-latency kernels (NVSHMEM/IBGDA -> MSCCL++ PortChannel):
|
||||
ported and validated on 2 nodes x 8 H100 GPUs with
|
||||
``test/python/ext/ep/test_low_latency_multirank.py``.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
@@ -1,5 +1,6 @@
|
||||
// Copyright (c) Microsoft Corporation.
|
||||
// Licensed under the MIT License.
|
||||
#pragma once
|
||||
|
||||
#include <ATen/cuda/CUDAContext.h>
|
||||
#include <memory>
|
||||
|
||||
@@ -22,9 +22,9 @@
|
||||
// position in the connected-peer map. In the recommended 1-GPU-per-node
|
||||
// LL topology, `peer_idx == dst_rank`; see src/ext/ep/README.md.
|
||||
//
|
||||
// WARNING: This port is untested on multi-node H100; performance will NOT
|
||||
// match IBGDA (host-proxy adds latency). Functional correctness needs
|
||||
// validation on real hardware.
|
||||
// Validated on 2 nodes x 8 H100 GPUs via
|
||||
// `test/python/ext/ep/test_low_latency_multirank.py`. Performance does NOT
|
||||
// match IBGDA (host-proxy adds latency); see README for measurements.
|
||||
|
||||
#include "configs.cuh"
|
||||
#include "exception.cuh"
|
||||
|
||||
Reference in New Issue
Block a user