mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-05-25 23:34:49 +00:00
test/ext/ep: intranode HT — parameterize num_sms and NVL chunk sizes
Adds MSCCLPP_EP_NUM_SMS / MSCCLPP_EP_NVL_SEND / MSCCLPP_EP_NVL_RECV env overrides for ep.Config(num_sms, num_max_nvl_chunked_send_tokens, num_max_nvl_chunked_recv_tokens). Defaults unchanged (20, 8, 256). Sweep on 4-rank intranode HT (tokens=4096, hidden=7168, experts=256): sms=20, NVL_SEND=8, NVL_RECV=256 -> d_recv=50.76, c_recv=65.66 GB/s sms=64, NVL_SEND=16, NVL_RECV=512 -> d_recv=57.75, c_recv=150.46 GB/s d_recv (actual NVL throughput per rank) plateaus at ~57 GB/s for topk>=2; combine recv scales near-linearly with num_sms.
This commit is contained in:
@@ -106,7 +106,9 @@ def main():
|
||||
|
||||
# Allocate Buffer (intranode only: num_rdma_bytes=0). Size the NVL buffer
|
||||
# using max(hidden, bench_hidden) so the optional bench phase fits.
|
||||
cfg = ep.Config(20, 8, 256)
|
||||
cfg = ep.Config(int(os.environ.get("MSCCLPP_EP_NUM_SMS", "20")),
|
||||
int(os.environ.get("MSCCLPP_EP_NVL_SEND", "8")),
|
||||
int(os.environ.get("MSCCLPP_EP_NVL_RECV", "256")))
|
||||
_bench_on = os.environ.get("MSCCLPP_EP_BENCH", "0") == "1"
|
||||
_buf_hidden = max(hidden, int(os.environ.get("MSCCLPP_EP_BENCH_HIDDEN", "0"))) if _bench_on else hidden
|
||||
num_nvl_bytes = cfg.get_nvl_buffer_size_hint(_buf_hidden * x.element_size(), num_ranks)
|
||||
|
||||
Reference in New Issue
Block a user