mirror of
https://github.com/microsoft/mscclpp.git
synced 2026-05-24 14:54:51 +00:00
At 16 nodes (64 ranks) with topk=8, expected combine values reach rank*8 = 504, while intermediate partial sums (rank*7 etc.) cross the bf16 ulp=2 boundary at 256. With the test pattern x = rank*ones and weights = 1, this produces deterministic +/-1 round-off on certain ranks (odd local_rank on nodes >= 9), tripping the previous 1e-2 absolute tolerance even though the kernel is correct. Use tol = max(1e-2, max_exp / 64) which matches the bf16 mantissa precision and scales with the magnitude of the expected combined output. The previous tight bound is preserved for small-scale runs where max_exp < 0.64.
24 KiB
24 KiB