core: TODO notes on CUDA-IPC atomicAdd context/flush caveats

Annotate the two known issues flagged by Copilot review on PR #796:

- atomicadd_kernel.cu: launching the atomicAdd kernel from a separate
  CUDA context while `dst` is a CUDA-IPC mapping registered in the
  primary context is technically UB; works in practice on current
  drivers but should be revisited.
- context.cc: `CudaIpcStream::sync()` deliberately skips
  `proxyAtomicStream_` to avoid deadlocking the proxy thread, with
  the side effect that `Connection::flush()` does not order pending
  remote atomicAdd ops on the CUDA-IPC transport.

Both behaviors were cherry-picked from DeepEP branch
`chhwang/dev-atomic-add-cleanup` and should be revisited before this
lands on `main`.
This commit is contained in:
Qinghua Zhou
2026-05-06 03:44:10 +00:00
parent 23e8ce6dbe
commit 01032fa167
2 changed files with 16 additions and 0 deletions

View File

@@ -24,6 +24,14 @@ void CudaIpcStream::atomicAdd(uint64_t* dst, int64_t value) {
// The CUDA runtime uses a per-context lock; the main thread holds it while waiting for
// the test kernel, and the proxy thread needs it to launch the atomicAdd kernel.
// A separate CUDA context avoids this contention.
//
// TODO(#796): `dst` is a CUDA-IPC mapping registered in the primary/runtime context, so
// launching this kernel from `proxyAtomicCtx_` is technically UB (device pointers are
// context-scoped). It works in practice on current drivers because the IPC handle aliases
// the same physical allocation, but a correct fix would either (a) avoid the separate
// context (e.g. break the deadlock differently) or (b) re-open the IPC mapping inside
// `proxyAtomicCtx_`. Carried over from the DeepEP `chhwang/dev-atomic-add-cleanup`
// cherry-pick; revisit before this lands on `main`.
if (!proxyAtomicCtx_) {
CUdevice cuDevice;
CUresult res = cuDeviceGet(&cuDevice, deviceId_);

View File

@@ -63,6 +63,14 @@ void CudaIpcStream::sync() {
// operations that complete asynchronously on the GPU. Syncing them here would deadlock
// because sync() is called from the proxy thread while the main thread may hold the
// device context via cudaStreamSynchronize() on the test kernel's stream.
//
// TODO(#796): As a side effect, `Connection::flush()` does not order/complete pending
// remote `atomicAdd` operations on the CUDA-IPC transport, so PortChannel flush no
// longer guarantees that a peer kernel sees the updated value. EP currently relies on
// higher-level signaling (PortChannel signal/wait, FIFO drain) for ordering, but a
// correct fix needs a deadlock-free way to drain `proxyAtomicStream_` here. Carried
// over from the DeepEP `chhwang/dev-atomic-add-cleanup` cherry-pick; revisit before
// this lands on `main`.
}
IbCtx* Context::Impl::getIbContext(Transport ibTransport) {