mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-13 01:36:10 +00:00

Author	SHA1	Message	Date
empyreus	d7b0dd627e	trying to rework image pull	2026-04-01 15:31:32 +00:00
empyreus	f5159b0e16	check if pipeline needs creation	2026-04-01 14:45:59 +00:00
empyreus	503647c128	fix missing quote	2026-03-31 20:38:55 +00:00
empyreus	36c496dc98	readd tests	2026-03-31 19:05:37 +00:00
empyreus	49aeea0660	fix container deletion	2026-03-31 19:05:11 +00:00
empyreus	0c8f4fd583	find directory	2026-03-31 18:55:22 +00:00
empyreus	80194b2803	fix directory	2026-03-31 16:28:35 +00:00
empyreus	48a6a2e441	add sglang all_reduce	2026-03-31 15:47:36 +00:00
empyreus	f938f60505	update sglang-test	2026-03-30 18:08:25 +00:00
empyreus	a22104c391	add remaining tests	2026-03-30 17:00:18 +00:00
empyreus	83d9301e24	full run	2026-03-27 22:20:05 +00:00
empyreus	6ac12fa1d5	comment out to fix pipeline	2026-03-27 22:16:06 +00:00
empyreus	a9d7bd8918	fix	2026-03-27 22:14:51 +00:00
empyreus	f171663d4e	fix batch size	2026-03-27 21:57:07 +00:00
empyreus	38552a6f9c	fix remote run and clean up files	2026-03-27 21:13:07 +00:00
empyreus	324254d57c	finish adding sglang steps	2026-03-27 20:28:45 +00:00
empyreus	4107fa9644	fix run remote	2026-03-27 20:05:17 +00:00
empyreus	35148991e8	fix cmake	2026-03-27 19:32:47 +00:00
empyreus	fa30289415	update for new remote run	2026-03-26 23:51:07 +00:00
empyreus	0a6d329bb8	add sshke	2026-03-26 23:37:14 +00:00
empyreus	e423ca8952	rename files	2026-03-26 23:34:42 +00:00
empyreus	8d642637e5	update template	2026-03-26 22:42:26 +00:00
empyreus	c541f27fb6	update template	2026-03-26 22:41:02 +00:00
empyreus	99f2faced2	fixes	2026-03-26 22:32:09 +00:00
empyreus	01b7af9733	sanity check	2026-03-26 22:23:11 +00:00
empyreus	57eedc915a	Merge branch 'main' into rjsouza/sglang-tests	2026-03-26 22:21:42 +00:00
empyreus	7f016cb7f0	fix venv	2026-03-26 22:01:48 +00:00
empyreus	257735cb51	setup sglang python venv	2026-03-26 21:43:56 +00:00
empyreus	8b007892df	check hostname	2026-03-26 21:31:22 +00:00
empyreus	4cf9cd721b	change cuda version	2026-03-26 20:50:32 +00:00
empyreus	12d935ff81	fix docker name	2026-03-26 20:22:51 +00:00
empyreus	d33458db8a	moidfy to setup sglang	2026-03-26 19:23:24 +00:00
empyreus	c883994556	add git clone msccl	2026-03-26 18:21:44 +00:00
empyreus	dd3c3ed7cb	change image	2026-03-25 22:05:18 +00:00
empyreus	ee771ec4c0	fix dockerfile	2026-03-25 21:52:09 +00:00
empyreus	fa24653d8d	update docker image	2026-03-25 21:45:29 +00:00
Copilot	93f6eeaa6b	Remove GTest dependency, add code coverage, and refactor unit tests and CI pipelines (#744 ) - Removes the GTest dependency, replacing it with a minimal custom framework (`test/framework.`) that covers only what the tests actually use — a unified `TEST()` macro with SFINAE-based fixture auto-detection, `EXPECT_`/`ASSERT_*` assertions, environments, and setup/teardown. - `--exclude-perf-tests` flag and substring-based negative filtering - `MSCCLPP_ENABLE_COVERAGE` CMake option with gcov/lcov; CI uploads to Codecov - Merges standalone `test/perf/` into main test targets - Refactors Azure pipelines to reduce redundancies & make more readable --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2026-03-24 23:34:38 -04:00
empyreus	b7adec0e60	create sglang docker image	2026-03-19 20:03:00 +00:00
Binyang Li	5d18835417	Fix use-after-free for fabric allocation handle in GpuIpcMemHandle (#764 ) ## Summary Fix a use-after-free where the CUDA allocation handle (`CUmemGenericAllocationHandle`) was released prematurely while the exported fabric handle still referenced it. ## Problem Unlike POSIX FD handles (where the kernel keeps the allocation alive via the open file descriptor), fabric handles do not hold their own reference to the underlying allocation. The original code called `cuMemRelease(allocHandle)` immediately after exporting the fabric handle, freeing the allocation. When a remote process later tries to `cuMemImportFromShareableHandle` using that fabric handle, it references a freed allocation — a use-after-free. This affected both code paths: 1. `GpuIpcMemHandle::create()`: The local `allocHandle` obtained via `cuMemRetainAllocationHandle` was released right after fabric export, leaving the fabric handle dangling. 2. `GpuIpcMemHandle::createMulticast()`: The `allocHandle` from `cuMulticastCreate` was unconditionally released, even when it was the only thing keeping the multicast object alive for the fabric handle. ## Fix - Added `allocHandle` field to the `fabric` struct in `GpuIpcMemHandle` to store the allocation handle and keep it alive for the lifetime of the `GpuIpcMemHandle`. - `create()`: Retain an additional reference via `cuMemRetainAllocationHandle` and store it in `fabric.allocHandle` when a fabric handle is successfully exported. - `createMulticast()`: Store the `allocHandle` directly in `fabric.allocHandle` instead of unconditionally releasing it. Only release if fabric export was not used. - `deleter()`: Release `fabric.allocHandle` via `cuMemRelease` when the handle type includes `Fabric`, ensuring proper cleanup. - `GpuIpcMem` constructor (importer side): Clear `fabric.allocHandle` after importing, since the importer gets its own handle via `cuMemImportFromShareableHandle` and should not release the exporter's allocation handle. ## Files Changed - `src/core/include/gpu_ipc_mem.hpp` — Added `CUmemGenericAllocationHandle allocHandle` to fabric struct. - `src/core/gpu_ipc_mem.cc` — Retain/release allocation handle properly across create, createMulticast, deleter, and importer paths.	2026-03-19 11:52:09 -07:00
empyreus	c38c3517fd	attempting to gix az cli	2026-03-18 19:36:40 +00:00
empyreus	08092653b2	install pip systemwide	2026-03-18 19:10:56 +00:00
empyreus	b7ede93f13	move from apt-get to pip	2026-03-18 18:55:07 +00:00
empyreus	4742dfef39	fix sudo issue	2026-03-18 18:30:09 +00:00
empyreus	343c3671ef	fix sudo	2026-03-18 18:07:25 +00:00
empyreus	ffa120f6b1	rework template	2026-03-17 21:58:01 +00:00
Empyreus	8686d81de5	testing	2026-03-17 19:45:07 +00:00
Empyreus	371dfb3cc3	fix pip	2026-03-17 19:19:28 +00:00
Empyreus	431234f0a4	inital pipeline test	2026-03-17 18:45:42 +00:00
Empyreus	5f42426dc8	inital creation of test files	2026-03-16 17:47:48 +00:00
Binyang Li	bf946ea51e	Fix multicast handle leak, cuMemMap offset handling, and rename NVLS allreduce algorithms (#759 ) ## Summary This PR addresses a multicast resource leak, fixes `cuMemMap` offset handling for multicast handles, renames NVLS allreduce algorithm classes for clarity, and adds a new unit test for `SwitchChannel`. ### Bug Fixes #### 1. Fix multicast allocation handle leak in `createMulticast()` (`gpu_ipc_mem.cc`) `GpuIpcMemHandle::createMulticast()` called `cuMulticastCreate(&allocHandle, ...)` but never released the local `allocHandle` after exporting it to shareable handles (POSIX FD / Fabric). This caused a reference count leak — the multicast object was never freed even after all mappings and imported handles were released. Per the [CUDA Driver API docs for `cuMemRelease`](https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__VA.html): > "The memory allocation will be freed when all outstanding mappings to the memory are unmapped and when all outstanding references to the handle (including its shareable counterparts) are also released." The fix adds `cuMemRelease(allocHandle)` after export, matching the existing pattern used for regular allocations in `GpuIpcMemHandle::create()`. Impact: Without this fix, repeated creation/destruction of NVLS connections causes OOM after ~120 iterations when allocating 1GB multicast buffers on H100. #### 2. Fix `cuMemMap` offset for multicast handles (`gpu_ipc_mem.cc`) `cuMemMap` requires `offset=0` for multicast handles. Previously, the code attempted to map at a non-zero offset within the multicast object, leading to errors when binding multiple buffers to the same `NvlsConnection`. The fix maps the entire range `[0, mcOffset + bufferSize)` and returns the pointer offset by `mcOffset`. This only consumes extra virtual address space; no additional physical memory is used. ### Refactoring #### 3. Rename NVLS allreduce algorithm classes Renamed for clarity: - `AllreduceNvls` → `AllreduceNvlsZeroCopy` - `AllreduceNvlsWithCopy` → `AllreduceNvlsWarpPipeline` - `AllreduceNvlsWithCopy2` → `AllreduceNvlsBlockPipeline` Updated all references in builder, selector, docs, and examples. #### 4. Move `nvlsConnections` setup to `initialize()` Moved `nvlsConnections_` from `AlgorithmCtx` (which no longer has this member) to individual algorithm class members, initialized in their `initialize()` methods. ### Tests #### 5. Add `TwoChannelsSameConnection` test New unit test that creates two `SwitchChannel` instances from the same `NvlsConnection`, performs reduce operations on both, and verifies correctness. This exercises the multi-bind path that triggered the `cuMemMap` offset fix. ### Files Changed - `src/core/gpu_ipc_mem.cc` — multicast handle leak fix + cuMemMap offset fix - `src/ext/collectives/allreduce/allreduce_nvls_zero_copy.cu` (renamed) - `src/ext/collectives/allreduce/allreduce_nvls_warp_pipeline.cu` (renamed) - `src/ext/collectives/allreduce/allreduce_nvls_block_pipeline.cu` (renamed) - `src/ext/collectives/allreduce/allreduce_nvls_packet.cu` — nvlsConnections fix - `src/ext/collectives/include/allreduce/*.hpp` — renamed headers - `src/ext/collectives/algorithm_collection_builder.cc` — updated references - `src/ext/nccl/algorithm_selector.cc` — updated algorithm names - `test/mp_unit/switch_channel_tests.cu` — new test - `docs/guide/mscclpp-torch-integration.md` — updated names - `examples/torch-integration/customized_comm_with_default_algo.py` — updated names	2026-03-09 10:22:45 -07:00

1 2 3 4 5 ...

960 Commits