Files
mscclpp/test/python
Qinghua Zhou 1600074f09 tests/ep: hoist combine output tensor out of the timed loop
The LL combine benchmark was cloning the ~58 MB dispatch recv buffer
('recv_x.clone()') on every timed iteration, adding ~20 us of D2D
memcpy per sample and masking kernel-level changes. It also called
torch.empty() for the output inside the loop. Both now live outside
the timed region; the kernel is invoked against a persistent bench_out
and the recv_x produced by the most recent dispatch.
2026-04-24 21:06:49 +00:00
..