Files
mscclpp/test
Qinghua Zhou 453160cc06 src/ext/ep: port low-latency dispatch/combine kernels
Port DeepEP's pure-RDMA low-latency (LL) MoE kernels from
csrc/kernels/internode_ll.cu (branch chhwang/dev-atomic-add-cleanup)
into the MSCCL++ EP extension. NVSHMEM / IBGDA device primitives are
replaced with MSCCL++ PortChannelDeviceHandle operations:

  nvshmemx_barrier_all_block()            -> port-channel signal+wait ring
  nvshmemi_ibgda_put_nbi_warp(...)        -> lane-0 PortChannel.put(...)
  nvshmemi_ibgda_amo_nonfetch_add(...)    -> lane-0 PortChannel.atomicAdd(...)

The atomicAdd path relies on the MSCCL++ Connection::atomicAdd /
PortChannelDeviceHandle::atomicAdd API cherry-picked from branch
chhwang/new-atomic-add; the LL dispatch path uses a signed delta
(-num_tokens_sent - 1) which the new int64_t signature supports.

Changes:
* New file src/ext/ep/kernels/internode_ll.cu (~530 lines) with the
  three kernels clean_low_latency_buffer, dispatch<kUseFP8,...>,
  combine<...> plus their launchers. rdma_buffer_ptr is threaded
  through the launchers so the kernel can translate virtual addresses
  into registered-memory offsets expected by MSCCL++.
* kernels/api.cuh: replace the single stub signature with full LL
  launcher prototypes.
* buffer.cc: replace the four LL throw-stubs
  (clean_low_latency_buffer, low_latency_dispatch,
  low_latency_combine, get_next_low_latency_combine_buffer) with
  torch-Tensor implementations ported from DeepEP/csrc/deep_ep.cpp.
* Drop src/ext/ep/internode_stub.cc and its CMake entry.
* python/mscclpp/ext/ep/buffer.py: remove the low_latency_mode=True
  NotImplementedError guard; update docstring.
* test/python/ext/ep/test_ep_smoke.py: rename
  test_low_latency_rejected -> test_low_latency_buffer_construct
  to reflect that LL construction is now accepted.
* src/ext/ep/README.md: update status matrix, document the
  NVSHMEM -> MSCCL++ translation table, and list the known
  limitations.

This is a structural port: the kernels compile, link, and pass the
single-rank smoke tests, but end-to-end behaviour on multi-node H100
is not yet validated. Two known caveats:

  1. Performance will NOT match IBGDA because MSCCL++ port channels
     use a CPU proxy; this port is for functional parity, not latency.
  2. Buffer::sync() in LL mode only connects peers that share the
     same local GPU id (DeepEP convention), so the LL kernels assume
     a one-GPU-per-node topology (num_ranks == num_rdma_ranks).
     Multi-GPU-per-node LL layouts will need a follow-up in sync().

Tested:
  cmake --build build -j --target mscclpp_ep_cpp   # builds clean
  pytest test/python/ext/ep/test_ep_smoke.py        # 3 passed
2026-04-20 21:46:00 +00:00
..
2026-04-16 21:24:45 -07:00
2026-04-20 18:32:36 +00:00
2025-10-20 17:23:01 -07:00

MSCCL++ C++ Test Framework

A lightweight, GTest-like test framework with MPI support for testing MSCCL++ C++ APIs. Defined in framework.hpp / framework.cc.

Adding a New Test (Step-by-Step)

Single-process test (unit/)

  1. Create the test file test/unit/my_feature_tests.cc (or .cu for CUDA):

    #include "../framework.hpp"
    #include <mscclpp/my_feature.hpp>
    
    TEST(MyFeatureTest, BasicUsage) {
      EXPECT_EQ(myFunction(), 42);
    }
    
  2. Register it in CMake — add the filename to test/unit/CMakeLists.txt:

    target_sources(unit_tests PRIVATE
        ...
        my_feature_tests.cc   # <-- add here
    )
    
  3. Build and run:

    cmake --build build -j
    ./build/test/unit_tests --filter=MyFeatureTest
    

Multi-process test (mp_unit/)

  1. Create the test file test/mp_unit/my_feature_tests.cc (or .cu):

    #include "mp_unit_tests.hpp"
    
    TEST(MyFeatureTest, MultiRank) {
      int rank = gEnv->rank;
      EXPECT_GE(rank, 0);
    }
    

    Use fixtures from mp_unit_tests.hpp (e.g., CommunicatorTest) if you need pre-established connections.

  2. Register it in CMake — add the filename to test/mp_unit/CMakeLists.txt:

    target_sources(mp_unit_tests PRIVATE
        ...
        my_feature_tests.cc   # <-- add here
    )
    
  3. Build and run:

    cmake --build build -j
    mpirun -np 2 ./build/test/mp_unit_tests --filter=MyFeatureTest
    

Notes

  • No separate test registration step is needed — TEST() auto-registers via static initialization.
  • The test_framework static library is built from framework.cc in the top-level test/CMakeLists.txt and linked into both unit_tests and mp_unit_tests. You do not need to modify it.
  • Use .cu extension for files that contain CUDA kernel code; use .cc for host-only tests.
  • Each test binary needs a main() that calls RUN_ALL_TESTS(). See unit/unit_tests_main.cc (single-process) and mp_unit/mp_unit_tests.cc (multi-process with Environment setup).
  • Additional run options: --filter=-Pattern (exclude), --exclude-perf-tests (skip PERF_TESTs).

Macros

Macro Behavior
TEST(Suite, Name) Register a test. If Suite is a defined class, it's used as a fixture.
PERF_TEST(Suite, Name) Same as TEST but marked as perf (skippable via --exclude-perf-tests).
EXPECT_* Non-fatal assertions: EXPECT_TRUE, EXPECT_FALSE, EXPECT_EQ, EXPECT_NE, EXPECT_LT, EXPECT_LE, EXPECT_GT, EXPECT_GE
ASSERT_* Fatal assertions (abort test on failure): same variants as EXPECT_*, plus ASSERT_NO_THROW
FAIL() Fail immediately. Supports streaming: FAIL() << "reason";
SKIP_TEST() Skip the current test. Supports streaming: SKIP_TEST() << "reason";
CUDA_CHECK(call) Check a CUDA API return code, throw on error.

Fixtures

Define a class inheriting from mscclpp::test::TestCase with SetUp() / TearDown(), then use the class name as the suite name:

class MyFixture : public mscclpp::test::TestCase {
 public:
  void SetUp() override { /* per-test setup */ }
  void TearDown() override { /* per-test cleanup */ }
 protected:
  int sharedState_ = 0;
};

TEST(MyFixture, SomeTest) {
  sharedState_ = 42;
  EXPECT_EQ(sharedState_, 42);
}

See mp_unit/mp_unit_tests.hpp (BootstrapTest, CommunicatorTest, etc.) for real fixture examples.

Global Environments

Register an Environment subclass for one-time global setup/teardown (e.g., MPI bootstrap):

class MyEnv : public mscclpp::test::Environment {
 public:
  void SetUp() override { /* global init */ }
  void TearDown() override { /* global cleanup */ }
};

// In main(), before RUN_ALL_TESTS():
mscclpp::test::TestRegistry::instance().addEnvironment(new MyEnv());

See mp_unit/mp_unit_tests.cc for the MultiProcessTestEnv example.

Utilities

  • mscclpp::test::utils::isMainRank() — true on MPI rank 0
  • mscclpp::test::utils::getMPIRank() / getMPISize()
  • mscclpp::test::utils::Timer — high-resolution timer with start(), stop(), elapsedMilliseconds()
  • mscclpp::test::currentTestName() — returns "Suite.Name" for the running test