mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-05-11 17:00:22 +00:00
Add optional out_packed_recv_x / out_src_info / out_layout_range / out_count parameters to Buffer::low_latency_dispatch so callers can hoist the four recv-side allocations out of a hot loop, mirroring the existing out= path on low_latency_combine. The bench in test_low_latency_multirank.py preallocates these tensors once and passes them on every iter so the timed loop reflects kernel cost, not torch.empty + caching-allocator overhead.