Files
mscclpp/src
Qinghua Zhou fec40601b8 ext/ep: add opt-in in-kernel timestamp profiling for LL dispatch/combine
Add an off-by-default CMake option MSCCLPP_EP_LL_PROFILE that compiles
in per-block clock64() timestamps in the low-latency dispatch and
combine kernels, plus host-side readback in buffer.cc that prints a
min/avg/max breakdown when MSCCLPP_EP_LL_PROFILE_PRINT=1.

Per-block slots written to workspace tail (layout [block][4] uint64_t):
  [0]=phase entry  [1]=send done  [2]=wait done  [3]=kernel exit

Workspace offsets are chosen to coexist with the dispatch atomic
counters (atomic_counter_per_expert + atomic_finish_counter_per_expert
occupy [0..2*num_experts*sizeof(int)]). Dispatch prof_buf at offset
1024; combine prof_buf at offset 65536. The earlier draft placed
combine prof at offset 256, which silently corrupted dispatch's
atomic_finish_counter_per_expert between iterations and caused iter
1+ to hang in dispatch's FINISHED_SUM_TAG spinwait.

Profile breakdown (us, 16 ranks, TOKENS=128/TOPK=8) confirms combine
is wait-dominated:
  combine  send~10  wait~50  reduce~40
  dispatch send~30  wait~100 unpack~50 (FP8 cast in send)

OFF by default (zero overhead). Enable with:
  cmake -DMSCCLPP_EP_LL_PROFILE=ON .
  MSCCLPP_EP_LL_PROFILE_PRINT=1 ./run_ll_mpirun.sh
2026-05-07 21:35:45 +00:00
..
2026-01-21 20:32:24 -08:00