mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-07-18 17:57:25 +00:00

Files

Qinghua Zhou 1600074f09 tests/ep: hoist combine output tensor out of the timed loop

The LL combine benchmark was cloning the ~58 MB dispatch recv buffer
('recv_x.clone()') on every timed iteration, adding ~20 us of D2D
memcpy per sample and masking kernel-level changes. It also called
torch.empty() for the output inside the loop. Both now live outside
the timed region; the kernel is invoked against a persistent bench_out
and the recv_x produced by the most recent dispatch.

2026-04-24 21:06:49 +00:00

ext/ep

tests/ep: hoist combine output tensor out of the timed loop

2026-04-24 21:06:49 +00:00