Files
mscclpp/include/mscclpp/proxy.hpp
Xingbo Wu 69565a2f32 Do threadInit/cudaSetDevice before other cuda calls (#757)
I recently encountered a weird memory usage issue.
After starting the proxy service on a cuda device X > 0, I notice an
unexpected thread entity apprear on both the GPU X and GPU 0, where GPU
0's share is about 500MB. Note that when the device is 0, there is no
extra memory usage.
The image clearly shows that when 8 ranks each using one GPU and
starting proxies, the GPU 0 sees 7 extra threads, each consuming 500MB
extra memory.
<img width="1247" height="1367" alt="Screenshot 2026-02-28 000153"
src="https://github.com/user-attachments/assets/cfd0d47f-319b-4ebb-bf19-dec66062e6f4"
/>


After tracking down to when it happens, I identified the root cause in
Proxy thread initialization.

    // never capture in a proxy thread
    auto mode = cudaStreamCaptureModeRelaxed;
    MSCCLPP_CUDATHROW(cudaThreadExchangeStreamCaptureMode(&mode));

    pimpl_->threadInit();

The call to cudaThreadExchangeStreamCaptureMode() actually triggers some
resource allocation on the "current device" which is still 0 for the
starting thread.
The later threadInit() is too late to set the correct GPU number.

The fix is simple: call threadInit() before the first cuda call:

    pimpl_->threadInit();
    // never capture in a proxy thread
    auto mode = cudaStreamCaptureModeRelaxed;
    MSCCLPP_CUDATHROW(cudaThreadExchangeStreamCaptureMode(&mode));

This guarantees that the current device is properly set before calling
any resource-allocating cuda functions.

This is the memory usage after the fix. The extra memory usages are
gone.

<img width="1242" height="459" alt="Image (1)"
src="https://github.com/user-attachments/assets/4256e4c8-6f1d-4844-9f77-5b2935387df9"
/>

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2026-03-02 15:53:59 -08:00

64 lines
1.6 KiB
C++

// Copyright (c) Microsoft Corporation.
// Licensed under the MIT license.
#ifndef MSCCLPP_PROXY_HPP_
#define MSCCLPP_PROXY_HPP_
#include <functional>
#include <memory>
#include "fifo.hpp"
namespace mscclpp {
/// Return values for ProxyHandler.
enum class ProxyHandlerResult {
/// Move to next trigger in FIFO.
Continue,
/// Stop and exit proxy.
Stop,
};
class Proxy;
/// Handler function type for proxy.
using ProxyHandler = std::function<ProxyHandlerResult(ProxyTrigger)>;
/// Host-side proxy for PortChannels.
class Proxy {
public:
/// Constructor.
/// @param handler Handler for each FIFO trigger.
/// @param threadInit Optional function run once in the proxy thread before FIFO consumption.
/// The function should initialize thread runtime context before any CUDA API call in that thread
/// (for example, set CUDA device and optionally bind NUMA affinity).
/// @param fifoSize FIFO size (default: DEFAULT_FIFO_SIZE).
Proxy(ProxyHandler handler, std::function<void()> threadInit, int fifoSize = DEFAULT_FIFO_SIZE);
/// Constructor.
/// @param handler Handler for each FIFO trigger.
/// @param fifoSize FIFO size (default: DEFAULT_FIFO_SIZE).
Proxy(ProxyHandler handler, int fifoSize = DEFAULT_FIFO_SIZE);
/// Destructor. Stops proxy if running.
~Proxy();
/// Start proxy.
void start(bool blocking = false);
/// Stop proxy.
void stop();
/// Get reference to FIFO used by proxy.
/// @return Shared pointer to FIFO.
std::shared_ptr<Fifo> fifo();
private:
struct Impl;
std::unique_ptr<Impl> pimpl_;
};
} // namespace mscclpp
#endif // MSCCLPP_PROXY_HPP_