Commit Graph

971 Commits

Author SHA1 Message Date
empyreus
827ca0935c fully remove msccl isntall step 2026-04-02 14:23:03 +00:00
empyreus
8e4ddc15ff remove msccl tests 2026-04-02 14:20:40 +00:00
empyreus
566cf93349 remove mscl build 2026-04-02 13:56:15 +00:00
empyreus
7948682dfa fix 2026-04-01 21:40:15 +00:00
empyreus
e4244c4466 fix clone 2026-04-01 21:19:04 +00:00
empyreus
d6dd64f463 fix copy 2026-04-01 20:35:21 +00:00
empyreus
3196758efb remove container 2026-04-01 18:29:43 +00:00
empyreus
ea8e6af959 fix missing container 2026-04-01 17:40:30 +00:00
empyreus
2080baad44 try new removal 2026-04-01 17:17:19 +00:00
empyreus
4f37637507 fixes 2026-04-01 16:09:53 +00:00
empyreus
131f128b6a comment out old docker pull 2026-04-01 15:37:28 +00:00
empyreus
d7b0dd627e trying to rework image pull 2026-04-01 15:31:32 +00:00
empyreus
f5159b0e16 check if pipeline needs creation 2026-04-01 14:45:59 +00:00
empyreus
503647c128 fix missing quote 2026-03-31 20:38:55 +00:00
empyreus
36c496dc98 readd tests 2026-03-31 19:05:37 +00:00
empyreus
49aeea0660 fix container deletion 2026-03-31 19:05:11 +00:00
empyreus
0c8f4fd583 find directory 2026-03-31 18:55:22 +00:00
empyreus
80194b2803 fix directory 2026-03-31 16:28:35 +00:00
empyreus
48a6a2e441 add sglang all_reduce 2026-03-31 15:47:36 +00:00
empyreus
f938f60505 update sglang-test 2026-03-30 18:08:25 +00:00
empyreus
a22104c391 add remaining tests 2026-03-30 17:00:18 +00:00
empyreus
83d9301e24 full run 2026-03-27 22:20:05 +00:00
empyreus
6ac12fa1d5 comment out to fix pipeline 2026-03-27 22:16:06 +00:00
empyreus
a9d7bd8918 fix 2026-03-27 22:14:51 +00:00
empyreus
f171663d4e fix batch size 2026-03-27 21:57:07 +00:00
empyreus
38552a6f9c fix remote run and clean up files 2026-03-27 21:13:07 +00:00
empyreus
324254d57c finish adding sglang steps 2026-03-27 20:28:45 +00:00
empyreus
4107fa9644 fix run remote 2026-03-27 20:05:17 +00:00
empyreus
35148991e8 fix cmake 2026-03-27 19:32:47 +00:00
empyreus
fa30289415 update for new remote run 2026-03-26 23:51:07 +00:00
empyreus
0a6d329bb8 add sshke 2026-03-26 23:37:14 +00:00
empyreus
e423ca8952 rename files 2026-03-26 23:34:42 +00:00
empyreus
8d642637e5 update template 2026-03-26 22:42:26 +00:00
empyreus
c541f27fb6 update template 2026-03-26 22:41:02 +00:00
empyreus
99f2faced2 fixes 2026-03-26 22:32:09 +00:00
empyreus
01b7af9733 sanity check 2026-03-26 22:23:11 +00:00
empyreus
57eedc915a Merge branch 'main' into rjsouza/sglang-tests 2026-03-26 22:21:42 +00:00
empyreus
7f016cb7f0 fix venv 2026-03-26 22:01:48 +00:00
empyreus
257735cb51 setup sglang python venv 2026-03-26 21:43:56 +00:00
empyreus
8b007892df check hostname 2026-03-26 21:31:22 +00:00
empyreus
4cf9cd721b change cuda version 2026-03-26 20:50:32 +00:00
empyreus
12d935ff81 fix docker name 2026-03-26 20:22:51 +00:00
empyreus
d33458db8a moidfy to setup sglang 2026-03-26 19:23:24 +00:00
empyreus
c883994556 add git clone msccl 2026-03-26 18:21:44 +00:00
empyreus
dd3c3ed7cb change image 2026-03-25 22:05:18 +00:00
empyreus
ee771ec4c0 fix dockerfile 2026-03-25 21:52:09 +00:00
empyreus
fa24653d8d update docker image 2026-03-25 21:45:29 +00:00
Copilot
93f6eeaa6b Remove GTest dependency, add code coverage, and refactor unit tests and CI pipelines (#744)
- Removes the GTest dependency, replacing it with a minimal custom
framework (`test/framework.*`) that covers only what the tests actually
use — a unified `TEST()` macro with SFINAE-based fixture auto-detection,
`EXPECT_*`/`ASSERT_*` assertions, environments, and setup/teardown.
- `--exclude-perf-tests` flag and substring-based negative filtering
- `MSCCLPP_ENABLE_COVERAGE` CMake option with gcov/lcov; CI uploads to
Codecov
- Merges standalone `test/perf/` into main test targets
- Refactors Azure pipelines to reduce redundancies & make more readable

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2026-03-24 23:34:38 -04:00
empyreus
b7adec0e60 create sglang docker image 2026-03-19 20:03:00 +00:00
Binyang Li
5d18835417 Fix use-after-free for fabric allocation handle in GpuIpcMemHandle (#764)
## Summary

Fix a use-after-free where the CUDA allocation handle
(`CUmemGenericAllocationHandle`) was released prematurely while the
exported fabric handle still referenced it.

## Problem

Unlike POSIX FD handles (where the kernel keeps the allocation alive via
the open file descriptor), fabric handles do not hold their own
reference to the underlying allocation. The original code called
`cuMemRelease(allocHandle)` immediately after exporting the fabric
handle, freeing the allocation. When a remote process later tries to
`cuMemImportFromShareableHandle` using that fabric handle, it references
a freed allocation — a **use-after-free**.

This affected both code paths:

1. **`GpuIpcMemHandle::create()`**: The local `allocHandle` obtained via
`cuMemRetainAllocationHandle` was released right after fabric export,
leaving the fabric handle dangling.
2. **`GpuIpcMemHandle::createMulticast()`**: The `allocHandle` from
`cuMulticastCreate` was unconditionally released, even when it was the
only thing keeping the multicast object alive for the fabric handle.

## Fix

- **Added `allocHandle` field** to the `fabric` struct in
`GpuIpcMemHandle` to store the allocation handle and keep it alive for
the lifetime of the `GpuIpcMemHandle`.
- **`create()`**: Retain an additional reference via
`cuMemRetainAllocationHandle` and store it in `fabric.allocHandle` when
a fabric handle is successfully exported.
- **`createMulticast()`**: Store the `allocHandle` directly in
`fabric.allocHandle` instead of unconditionally releasing it. Only
release if fabric export was not used.
- **`deleter()`**: Release `fabric.allocHandle` via `cuMemRelease` when
the handle type includes `Fabric`, ensuring proper cleanup.
- **`GpuIpcMem` constructor (importer side)**: Clear
`fabric.allocHandle` after importing, since the importer gets its own
handle via `cuMemImportFromShareableHandle` and should not release the
exporter's allocation handle.

## Files Changed

- `src/core/include/gpu_ipc_mem.hpp` — Added
`CUmemGenericAllocationHandle allocHandle` to fabric struct.
- `src/core/gpu_ipc_mem.cc` — Retain/release allocation handle properly
across create, createMulticast, deleter, and importer paths.
2026-03-19 11:52:09 -07:00