mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-05-13 01:36:10 +00:00
I recently encountered a weird memory usage issue. After starting the proxy service on a cuda device X > 0, I notice an unexpected thread entity apprear on both the GPU X and GPU 0, where GPU 0's share is about 500MB. Note that when the device is 0, there is no extra memory usage. The image clearly shows that when 8 ranks each using one GPU and starting proxies, the GPU 0 sees 7 extra threads, each consuming 500MB extra memory. <img width="1247" height="1367" alt="Screenshot 2026-02-28 000153" src="https://github.com/user-attachments/assets/cfd0d47f-319b-4ebb-bf19-dec66062e6f4" /> After tracking down to when it happens, I identified the root cause in Proxy thread initialization. // never capture in a proxy thread auto mode = cudaStreamCaptureModeRelaxed; MSCCLPP_CUDATHROW(cudaThreadExchangeStreamCaptureMode(&mode)); pimpl_->threadInit(); The call to cudaThreadExchangeStreamCaptureMode() actually triggers some resource allocation on the "current device" which is still 0 for the starting thread. The later threadInit() is too late to set the correct GPU number. The fix is simple: call threadInit() before the first cuda call: pimpl_->threadInit(); // never capture in a proxy thread auto mode = cudaStreamCaptureModeRelaxed; MSCCLPP_CUDATHROW(cudaThreadExchangeStreamCaptureMode(&mode)); This guarantees that the current device is properly set before calling any resource-allocating cuda functions. This is the memory usage after the fix. The extra memory usages are gone. <img width="1242" height="459" alt="Image (1)" src="https://github.com/user-attachments/assets/4256e4c8-6f1d-4844-9f77-5b2935387df9" /> --------- Co-authored-by: Binyang Li <binyli@microsoft.com>
64 lines
1.6 KiB
C++
64 lines
1.6 KiB
C++
// Copyright (c) Microsoft Corporation.
|
|
// Licensed under the MIT license.
|
|
|
|
#ifndef MSCCLPP_PROXY_HPP_
|
|
#define MSCCLPP_PROXY_HPP_
|
|
|
|
#include <functional>
|
|
#include <memory>
|
|
|
|
#include "fifo.hpp"
|
|
|
|
namespace mscclpp {
|
|
|
|
/// Return values for ProxyHandler.
|
|
enum class ProxyHandlerResult {
|
|
/// Move to next trigger in FIFO.
|
|
Continue,
|
|
/// Stop and exit proxy.
|
|
Stop,
|
|
};
|
|
|
|
class Proxy;
|
|
|
|
/// Handler function type for proxy.
|
|
using ProxyHandler = std::function<ProxyHandlerResult(ProxyTrigger)>;
|
|
|
|
/// Host-side proxy for PortChannels.
|
|
class Proxy {
|
|
public:
|
|
/// Constructor.
|
|
/// @param handler Handler for each FIFO trigger.
|
|
/// @param threadInit Optional function run once in the proxy thread before FIFO consumption.
|
|
/// The function should initialize thread runtime context before any CUDA API call in that thread
|
|
/// (for example, set CUDA device and optionally bind NUMA affinity).
|
|
/// @param fifoSize FIFO size (default: DEFAULT_FIFO_SIZE).
|
|
Proxy(ProxyHandler handler, std::function<void()> threadInit, int fifoSize = DEFAULT_FIFO_SIZE);
|
|
|
|
/// Constructor.
|
|
/// @param handler Handler for each FIFO trigger.
|
|
/// @param fifoSize FIFO size (default: DEFAULT_FIFO_SIZE).
|
|
Proxy(ProxyHandler handler, int fifoSize = DEFAULT_FIFO_SIZE);
|
|
|
|
/// Destructor. Stops proxy if running.
|
|
~Proxy();
|
|
|
|
/// Start proxy.
|
|
void start(bool blocking = false);
|
|
|
|
/// Stop proxy.
|
|
void stop();
|
|
|
|
/// Get reference to FIFO used by proxy.
|
|
/// @return Shared pointer to FIFO.
|
|
std::shared_ptr<Fifo> fifo();
|
|
|
|
private:
|
|
struct Impl;
|
|
std::unique_ptr<Impl> pimpl_;
|
|
};
|
|
|
|
} // namespace mscclpp
|
|
|
|
#endif // MSCCLPP_PROXY_HPP_
|