Files
mscclpp/src/core
Binyang Li 5d18835417 Fix use-after-free for fabric allocation handle in GpuIpcMemHandle (#764)
## Summary

Fix a use-after-free where the CUDA allocation handle
(`CUmemGenericAllocationHandle`) was released prematurely while the
exported fabric handle still referenced it.

## Problem

Unlike POSIX FD handles (where the kernel keeps the allocation alive via
the open file descriptor), fabric handles do not hold their own
reference to the underlying allocation. The original code called
`cuMemRelease(allocHandle)` immediately after exporting the fabric
handle, freeing the allocation. When a remote process later tries to
`cuMemImportFromShareableHandle` using that fabric handle, it references
a freed allocation — a **use-after-free**.

This affected both code paths:

1. **`GpuIpcMemHandle::create()`**: The local `allocHandle` obtained via
`cuMemRetainAllocationHandle` was released right after fabric export,
leaving the fabric handle dangling.
2. **`GpuIpcMemHandle::createMulticast()`**: The `allocHandle` from
`cuMulticastCreate` was unconditionally released, even when it was the
only thing keeping the multicast object alive for the fabric handle.

## Fix

- **Added `allocHandle` field** to the `fabric` struct in
`GpuIpcMemHandle` to store the allocation handle and keep it alive for
the lifetime of the `GpuIpcMemHandle`.
- **`create()`**: Retain an additional reference via
`cuMemRetainAllocationHandle` and store it in `fabric.allocHandle` when
a fabric handle is successfully exported.
- **`createMulticast()`**: Store the `allocHandle` directly in
`fabric.allocHandle` instead of unconditionally releasing it. Only
release if fabric export was not used.
- **`deleter()`**: Release `fabric.allocHandle` via `cuMemRelease` when
the handle type includes `Fabric`, ensuring proper cleanup.
- **`GpuIpcMem` constructor (importer side)**: Clear
`fabric.allocHandle` after importing, since the importer gets its own
handle via `cuMemImportFromShareableHandle` and should not release the
exporter's allocation handle.

## Files Changed

- `src/core/include/gpu_ipc_mem.hpp` — Added
`CUmemGenericAllocationHandle allocHandle` to fabric struct.
- `src/core/gpu_ipc_mem.cc` — Retain/release allocation handle properly
across create, createMulticast, deleter, and importer paths.
2026-03-19 11:52:09 -07:00
..
2026-01-21 20:32:24 -08:00
2026-02-18 10:35:44 -08:00
2026-01-21 20:32:24 -08:00
2026-01-21 20:32:24 -08:00
2026-01-21 20:32:24 -08:00
2026-01-21 20:32:24 -08:00
2026-01-21 20:32:24 -08:00
2026-01-21 20:32:24 -08:00
2026-01-21 20:32:24 -08:00
2026-01-21 20:32:24 -08:00
2026-01-21 20:32:24 -08:00
2026-01-21 20:32:24 -08:00
2026-01-21 20:32:24 -08:00
2026-01-21 20:32:24 -08:00
2026-01-21 20:32:24 -08:00
2026-01-21 20:32:24 -08:00
2026-01-21 20:32:24 -08:00
2026-01-21 20:32:24 -08:00
2026-01-21 20:32:24 -08:00