mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-12 09:17:06 +00:00

Files

Qinghua Zhou fec40601b8 ext/ep: add opt-in in-kernel timestamp profiling for LL dispatch/combine

Add an off-by-default CMake option MSCCLPP_EP_LL_PROFILE that compiles
in per-block clock64() timestamps in the low-latency dispatch and
combine kernels, plus host-side readback in buffer.cc that prints a
min/avg/max breakdown when MSCCLPP_EP_LL_PROFILE_PRINT=1.

Per-block slots written to workspace tail (layout [block][4] uint64_t):
  [0]=phase entry  [1]=send done  [2]=wait done  [3]=kernel exit

Workspace offsets are chosen to coexist with the dispatch atomic
counters (atomic_counter_per_expert + atomic_finish_counter_per_expert
occupy [0..2*num_experts*sizeof(int)]). Dispatch prof_buf at offset
1024; combine prof_buf at offset 65536. The earlier draft placed
combine prof at offset 256, which silently corrupted dispatch's
atomic_finish_counter_per_expert between iterations and caused iter
1+ to hang in dispatch's FINISHED_SUM_TAG spinwait.

Profile breakdown (us, 16 ranks, TOKENS=128/TOPK=8) confirms combine
is wait-dominated:
  combine  send~10  wait~50  reduce~40
  dispatch send~30  wait~100 unpack~50 (FP8 cast in send)

OFF by default (zero overhead). Enable with:
  cmake -DMSCCLPP_EP_LL_PROFILE=ON .
  MSCCLPP_EP_LL_PROFILE_PRINT=1 ./run_ll_mpirun.sh

2026-05-07 21:35:45 +00:00

core

ext/ep: GPU-initiated IBGDA path for low-latency dispatch/combine

2026-05-07 05:14:15 +00:00

ext

ext/ep: add opt-in in-kernel timestamp profiling for LL dispatch/combine

2026-05-07 21:35:45 +00:00

.gitignore

[python] switch to setup.py to build package

2023-04-12 12:29:17 -07:00

CMakeLists.txt

Torch integration (#692 )

2026-01-21 20:32:24 -08:00