mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-05-11 17:00:22 +00:00
core: TODO notes on CUDA-IPC atomicAdd context/flush caveats
Annotate the two known issues flagged by Copilot review on PR #796: - atomicadd_kernel.cu: launching the atomicAdd kernel from a separate CUDA context while `dst` is a CUDA-IPC mapping registered in the primary context is technically UB; works in practice on current drivers but should be revisited. - context.cc: `CudaIpcStream::sync()` deliberately skips `proxyAtomicStream_` to avoid deadlocking the proxy thread, with the side effect that `Connection::flush()` does not order pending remote atomicAdd ops on the CUDA-IPC transport. Both behaviors were cherry-picked from DeepEP branch `chhwang/dev-atomic-add-cleanup` and should be revisited before this lands on `main`.
This commit is contained in:
@@ -24,6 +24,14 @@ void CudaIpcStream::atomicAdd(uint64_t* dst, int64_t value) {
|
||||
// The CUDA runtime uses a per-context lock; the main thread holds it while waiting for
|
||||
// the test kernel, and the proxy thread needs it to launch the atomicAdd kernel.
|
||||
// A separate CUDA context avoids this contention.
|
||||
//
|
||||
// TODO(#796): `dst` is a CUDA-IPC mapping registered in the primary/runtime context, so
|
||||
// launching this kernel from `proxyAtomicCtx_` is technically UB (device pointers are
|
||||
// context-scoped). It works in practice on current drivers because the IPC handle aliases
|
||||
// the same physical allocation, but a correct fix would either (a) avoid the separate
|
||||
// context (e.g. break the deadlock differently) or (b) re-open the IPC mapping inside
|
||||
// `proxyAtomicCtx_`. Carried over from the DeepEP `chhwang/dev-atomic-add-cleanup`
|
||||
// cherry-pick; revisit before this lands on `main`.
|
||||
if (!proxyAtomicCtx_) {
|
||||
CUdevice cuDevice;
|
||||
CUresult res = cuDeviceGet(&cuDevice, deviceId_);
|
||||
|
||||
@@ -63,6 +63,14 @@ void CudaIpcStream::sync() {
|
||||
// operations that complete asynchronously on the GPU. Syncing them here would deadlock
|
||||
// because sync() is called from the proxy thread while the main thread may hold the
|
||||
// device context via cudaStreamSynchronize() on the test kernel's stream.
|
||||
//
|
||||
// TODO(#796): As a side effect, `Connection::flush()` does not order/complete pending
|
||||
// remote `atomicAdd` operations on the CUDA-IPC transport, so PortChannel flush no
|
||||
// longer guarantees that a peer kernel sees the updated value. EP currently relies on
|
||||
// higher-level signaling (PortChannel signal/wait, FIFO drain) for ordering, but a
|
||||
// correct fix needs a deadlock-free way to drain `proxyAtomicStream_` here. Carried
|
||||
// over from the DeepEP `chhwang/dev-atomic-add-cleanup` cherry-pick; revisit before
|
||||
// this lands on `main`.
|
||||
}
|
||||
|
||||
IbCtx* Context::Impl::getIbContext(Transport ibTransport) {
|
||||
|
||||
Reference in New Issue
Block a user