test/ext/ep: intranode HT — parameterize num_sms and NVL chunk sizes

Adds MSCCLPP_EP_NUM_SMS / MSCCLPP_EP_NVL_SEND / MSCCLPP_EP_NVL_RECV env
overrides for ep.Config(num_sms, num_max_nvl_chunked_send_tokens,
num_max_nvl_chunked_recv_tokens). Defaults unchanged (20, 8, 256).

Sweep on 4-rank intranode HT (tokens=4096, hidden=7168, experts=256):
  sms=20, NVL_SEND=8,  NVL_RECV=256 -> d_recv=50.76, c_recv=65.66 GB/s
  sms=64, NVL_SEND=16, NVL_RECV=512 -> d_recv=57.75, c_recv=150.46 GB/s

d_recv (actual NVL throughput per rank) plateaus at ~57 GB/s for topk>=2;
combine recv scales near-linearly with num_sms.
This commit is contained in:
qinghuazhou
2026-05-13 16:25:01 +00:00
parent f9f0d0fcb7
commit 33e59c2908

View File

@@ -106,7 +106,9 @@ def main():
# Allocate Buffer (intranode only: num_rdma_bytes=0). Size the NVL buffer
# using max(hidden, bench_hidden) so the optional bench phase fits.
cfg = ep.Config(20, 8, 256)
cfg = ep.Config(int(os.environ.get("MSCCLPP_EP_NUM_SMS", "20")),
int(os.environ.get("MSCCLPP_EP_NVL_SEND", "8")),
int(os.environ.get("MSCCLPP_EP_NVL_RECV", "256")))
_bench_on = os.environ.get("MSCCLPP_EP_BENCH", "0") == "1"
_buf_hidden = max(hidden, int(os.environ.get("MSCCLPP_EP_BENCH_HIDDEN", "0"))) if _bench_on else hidden
num_nvl_bytes = cfg.get_nvl_buffer_size_hint(_buf_hidden * x.element_size(), num_ranks)