mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-03-23 16:47:48 +00:00
I recently encountered a weird memory usage issue. After starting the proxy service on a cuda device X > 0, I notice an unexpected thread entity apprear on both the GPU X and GPU 0, where GPU 0's share is about 500MB. Note that when the device is 0, there is no extra memory usage. The image clearly shows that when 8 ranks each using one GPU and starting proxies, the GPU 0 sees 7 extra threads, each consuming 500MB extra memory. <img width="1247" height="1367" alt="Screenshot 2026-02-28 000153" src="https://github.com/user-attachments/assets/cfd0d47f-319b-4ebb-bf19-dec66062e6f4" /> After tracking down to when it happens, I identified the root cause in Proxy thread initialization. // never capture in a proxy thread auto mode = cudaStreamCaptureModeRelaxed; MSCCLPP_CUDATHROW(cudaThreadExchangeStreamCaptureMode(&mode)); pimpl_->threadInit(); The call to cudaThreadExchangeStreamCaptureMode() actually triggers some resource allocation on the "current device" which is still 0 for the starting thread. The later threadInit() is too late to set the correct GPU number. The fix is simple: call threadInit() before the first cuda call: pimpl_->threadInit(); // never capture in a proxy thread auto mode = cudaStreamCaptureModeRelaxed; MSCCLPP_CUDATHROW(cudaThreadExchangeStreamCaptureMode(&mode)); This guarantees that the current device is properly set before calling any resource-allocating cuda functions. This is the memory usage after the fix. The extra memory usages are gone. <img width="1242" height="459" alt="Image (1)" src="https://github.com/user-attachments/assets/4256e4c8-6f1d-4844-9f77-5b2935387df9" /> --------- Co-authored-by: Binyang Li <binyli@microsoft.com>