Commit Graph

632 Commits

Author SHA1 Message Date
caiomcbr
7493e2f075 Double buffering for NCCL APIs (#324)
Using two scratch buffers in each peer to exchange data.

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-07-15 22:18:53 +00:00
Binyang Li
5f9ee27aa8 Support to write packets via uint2 (#327) 2024-07-15 12:05:13 -07:00
Changho Hwang
c4ca2fbc8c Resolve clang++ warnings (#325) 2024-07-11 07:48:35 +00:00
caiomcbr
f4c3c8f916 AllReduce Kernel for Small Messages (#322)
Adding allreduce kernel code for message sizes smaller than 32 bytes,
when the number of elements are smaller than the number of ranks.

---------

Co-authored-by: Caio Rocha <caio.rocha@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-07-05 21:08:43 +00:00
Ziyue Yang
b5a48f836c Separate NPKit CPU timestamp access from different blocks for AMD platform (#321)
Reference: https://github.com/ROCm/rccl/pull/1229
2024-07-02 19:36:48 +08:00
Angelica Moreira
0f796bbdf7 Update allreduce_bench.py (#318)
Replacing hardcoded network interface name for generic discovery
strategy.

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-06-29 03:41:13 +00:00
caiomcbr
b1b9d0626c Support NCCL APIs (#319)
Start supporting NCCL APIs with a few limitations.

---------

Co-authored-by: Caio Rocha <caio.rocha@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-06-27 23:54:06 +00:00
Roshan Dathathri
91550dab4c Simplify/improve barrier in AllReduce6 (#317)
Drop superfluous __threadfence_system()
2024-06-23 21:08:59 +00:00
Angelica Moreira
34f4d9d006 Update quickstart.md (#314)
Updating the docker image name tag and the python benchmark path.
2024-06-19 22:26:13 +00:00
Roshan Dathathri
93ed8e1e58 Add support for multicast reduce insruction (#316) 2024-06-19 13:28:12 -07:00
Binyang Li
1351f9f1c5 Add "packet type" option for executor test (#313)
Add "packet type" option for executor test
2024-06-14 09:53:58 +00:00
Ziyue Yang
f29095b3b1 Fix NPKit support for AMD (#312) 2024-06-14 16:22:14 +08:00
Ziyue Yang
76328fe623 Add NPKit GPU event support (#310) 2024-06-13 13:59:50 +08:00
Binyang Li
80aefe55bc Cumulative Updates (#309)
Bug fix: Unable to execute communication primitives with the same
execution plan but varying message sizes.
Add reduce_packets OP
2024-06-12 19:17:57 +08:00
Changho Hwang
1f62dfd7cd Add C++ executor test (#304)
- Add C++ executor test
- Fix executor bugs for packet operation
- Enhance executor_test.py

---------

Co-authored-by: Binyang Li <binyli@microsoft.com>
2024-05-29 10:54:36 +00:00
Changho Hwang
cddffbc8b6 v0.5.1 (#308) v0.5.1 2024-05-26 14:31:29 -07:00
Binyang Li
3a18068cd4 Fix security issue (#305)
Change sprintf to snprintf to avoid potential security issue
2024-05-25 23:12:57 -07:00
Changho Hwang
f76eae4dca Fix assert declaration & add a compile test (#303) 2024-05-20 02:39:30 +00:00
Changho Hwang
d35a2f2dc2 Rename executor.cpp to executor_py.cpp (#301) 2024-05-17 13:31:27 -07:00
Changho Hwang
a3cd95bd42 Upgrade gtest (#300)
The new gtest version resolves a type casting issue:
3044657e7a
2024-05-07 20:49:26 -07:00
Changho Hwang
9c2a96060a v0.5.0 (#298) v0.5.0 2024-05-04 16:51:48 -07:00
aashaka
0650371b54 Allow obtaining cuda stream handle from PyTorch stream when launching kernel (#297)
Use `cuda_stream` attribute of a torch stream if the stream is not an
instance of the cupy stream.
2024-05-04 04:57:07 +00:00
Binyang Li
6226556ce2 Optimized the execution kernel (#294) 2024-05-03 11:54:50 -07:00
Binyang Li
fc977ce5dd Move pipeline to Azure org (#296)
Move multi-nodes pipeline to Azure org to meet the compliance
requirements. Remove default value for BASE_IMAGE. Not allowed to use
3rd party registry in Dockerfile directly.
2024-04-29 11:54:34 +08:00
Binyang Li
5628362715 Resolve multi-nodes test failure issue (#295)
Fix bug, resolve multi-nodes test failure issue.
2024-04-26 13:06:57 +08:00
Changho Hwang
d4ede480f4 Ethernet support (#284)
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Caio Rocha <caiorocha@microsoft.com>
2024-04-25 11:06:43 -07:00
Changho Hwang
89896ff94f Include GPU data types only for kernel code (#292) 2024-04-24 20:55:02 -07:00
Changho Hwang
6c1fa5307c Refactoring NVLS interfaces (#293)
Move NVLS details from the core to a separate interface
2024-04-24 10:05:41 -07:00
Changho Hwang
9934c982a8 Seperate headers for GPU data types (#291)
Prevent unnecessarily including data type headers in everywhere.
2024-04-19 05:52:43 +00:00
Roshan Dathathri
41e0964d93 Allow binding allocated memory to NVLS multicast pointer (#290)
And change NVLS multimem instructions to static functions
2024-04-18 17:11:31 -07:00
Binyang Li
64d837f9ab Add executor to execute schedule-plan file (#283)
Add executor to execute the JSON schedule file generated by msccl-tools

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-04-18 19:10:41 +00:00
Changho Hwang
9406123711 Fix a typo name (#286) 2024-04-17 23:45:46 +00:00
Changho Hwang
1a7cb98e3a v0.4.3 (#279) v0.4.3 2024-03-27 11:53:09 -07:00
Changho Hwang
5ba6ce00c7 Fix bootstrapping mechanism (#278)
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Pashupati Kumar <74680231+pash-msft@users.noreply.github.com>
2024-03-27 10:24:24 +08:00
Binyang Li
bc465aefcd Add __launch_bounds__ for mscclpp-test (#273) 2024-03-25 15:55:37 -07:00
Binyang Li
4734d8718f Fix multi-node ci pipeline (#272)
Add `__launch_bounds__` to fix perf regression issue in CI pipeline
2024-03-12 09:39:00 -07:00
Changho Hwang
cdaf3aea3d New packet format & optimizations (#256)
Co-authored-by: Binyang Li <binyli@microsoft.com>
2024-02-20 20:01:37 -08:00
Saeed Maleki
a3d0799963 Fix the comm.py for nvls (#267)
Fix the comm.py for nvls
2024-02-19 10:39:21 +08:00
Binyang Li
5971508eed Remove cuda-python from project (#245)
Remove cuda-python and use CuPy APIs instead

---------

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-02-13 21:44:11 +08:00
aashaka
d97fef4395 Allow semaphores and memory to be registered separately in ProxyService (#264)
This is needed in use cases where SimpleProxyChannel does not suffice.
For example, when a single semaphore is to be used for multiple tensors
or when multiple semaphores should be associated with a tensor.
2024-02-08 09:55:29 -08:00
Binyang Li
7c229fbdd8 Fix multi-nodes test failure (#262)
fix multi-nodes CI pipeline

Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-02-07 18:21:05 -08:00
aashaka
2101f5251e Allow MSCCL++ CommGroup to take PyTorch tensors in args (#255)
Obtain data_ptr and tensor_size accordingly for torch.Tensor

Co-authored-by: Binyang Li <binyli@microsoft.com>
2024-02-06 19:47:25 -08:00
Changho Hwang
6a19b19ece Fix NVLS support (#258)
* Do not compile nvls_test with ROCm
* Fix multi-node tests
2024-02-06 23:24:13 +00:00
Changho Hwang
d34e097b40 Fix wrong offset calculation (#257) 2024-02-06 08:55:43 +08:00
Saeed Maleki
91d592dcc0 NVLS support. (#250)
Co-authored-by: Saeed Maleki <saemal@microsoft.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
Co-authored-by: Changho Hwang <changhohwang@microsoft.com>
2024-02-04 20:46:10 -08:00
Changho Hwang
4eb0a08b8c Add putWithSignal() latency tests (#246) 2024-01-24 01:10:35 +00:00
aashaka
6c9d159e85 Increase MSCCLPP_BITS_REGMEM_HANDLE to 9 (#251)
MSCCLPP_BITS_REGMEM_HANDLE=8 limits the number registered memories for a
ProxyService to 256. Many use cases, such as KV cache transfer, require
registering more tensors.

This change allows registering up-to 512 memories. Note that this change
uses up the slack bits remaining in the ChannelTrigger struct.
2024-01-23 13:38:33 -08:00
Binyang Li
422c81f0f8 remove make pylib-copy command (#249)
Fix #216
Remove `make pylib-copy`
2024-01-19 12:29:15 -08:00
Changho Hwang
1178a9ee47 Minor improvement on device syncer (#231)
Saved 1~2us in some tests
2024-01-15 17:33:59 -08:00
Changho Hwang
c0fe31fa76 Mask each fields of the trigger (#244)
The behavior of `ProxyChannelDeviceHandle::put()` is undefined by design
when each field value is given to exceed the bits limitation (such as
`MSCCLPP_BITS_SIZE`). Even so, we'd better trim exceeding bits of each
field value for safety, so that the invalid usage of a field does not
propagate to other fields.
2024-01-10 19:31:47 -08:00