mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-06-29 10:57:27 +00:00

Files

Qinghua Zhou fdf7d579dc ext/ep: optional preallocated outputs for low_latency_dispatch

Add optional out_packed_recv_x / out_src_info / out_layout_range /
out_count parameters to Buffer::low_latency_dispatch so callers can
hoist the four recv-side allocations out of a hot loop, mirroring the
existing out= path on low_latency_combine.

The bench in test_low_latency_multirank.py preallocates these tensors
once and passes them on every iter so the timed loop reflects kernel
cost, not torch.empty + caching-allocator overhead.

2026-04-30 18:45:44 +00:00

ext/ep

ext/ep: optional preallocated outputs for low_latency_dispatch

2026-04-30 18:45:44 +00:00