mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-05-12 09:17:06 +00:00
Add an off-by-default CMake option MSCCLPP_EP_LL_PROFILE that compiles in per-block clock64() timestamps in the low-latency dispatch and combine kernels, plus host-side readback in buffer.cc that prints a min/avg/max breakdown when MSCCLPP_EP_LL_PROFILE_PRINT=1. Per-block slots written to workspace tail (layout [block][4] uint64_t): [0]=phase entry [1]=send done [2]=wait done [3]=kernel exit Workspace offsets are chosen to coexist with the dispatch atomic counters (atomic_counter_per_expert + atomic_finish_counter_per_expert occupy [0..2*num_experts*sizeof(int)]). Dispatch prof_buf at offset 1024; combine prof_buf at offset 65536. The earlier draft placed combine prof at offset 256, which silently corrupted dispatch's atomic_finish_counter_per_expert between iterations and caused iter 1+ to hang in dispatch's FINISHED_SUM_TAG spinwait. Profile breakdown (us, 16 ranks, TOKENS=128/TOPK=8) confirms combine is wait-dominated: combine send~10 wait~50 reduce~40 dispatch send~30 wait~100 unpack~50 (FP8 cast in send) OFF by default (zero overhead). Enable with: cmake -DMSCCLPP_EP_LL_PROFILE=ON . MSCCLPP_EP_LL_PROFILE_PRINT=1 ./run_ll_mpirun.sh