mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-06-29 19:07:30 +00:00

Files

Qinghua Zhou fdf7d579dc ext/ep: optional preallocated outputs for low_latency_dispatch

Add optional out_packed_recv_x / out_src_info / out_layout_range /
out_count parameters to Buffer::low_latency_dispatch so callers can
hoist the four recv-side allocations out of a hot loop, mirroring the
existing out= path on low_latency_combine.

The bench in test_low_latency_multirank.py preallocates these tensors
once and passes them on every iter so the timed loop reflects kernel
cost, not torch.empty + caching-allocator overhead.

2026-04-30 18:45:44 +00:00

test_ep_smoke.py

src/ext/ep: port low-latency dispatch/combine kernels

2026-04-20 21:46:00 +00:00

test_internode_multirank.py

tests/ep: add NCCL-EP six-metric BW breakdown (send/recv x total/nvl/rdma)

2026-04-29 20:44:10 +00:00

test_intranode_multirank.py

tests/ep: intranode send-side counts unique (token, dst_node) to match NCCL-EP

2026-04-29 23:31:47 +00:00

test_low_latency_multirank.py

ext/ep: optional preallocated outputs for low_latency_dispatch

2026-04-30 18:45:44 +00:00