caiomcbr
f4c3c8f916
AllReduce Kernel for Small Messages ( #322 )
...
Adding allreduce kernel code for message sizes smaller than 32 bytes,
when the number of elements are smaller than the number of ranks.
---------
Co-authored-by: Caio Rocha <caio.rocha@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-07-05 21:08:43 +00:00
Ziyue Yang
b5a48f836c
Separate NPKit CPU timestamp access from different blocks for AMD platform ( #321 )
...
Reference: https://github.com/ROCm/rccl/pull/1229
2024-07-02 19:36:48 +08:00
Angelica Moreira
0f796bbdf7
Update allreduce_bench.py ( #318 )
...
Replacing hardcoded network interface name for generic discovery
strategy.
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-06-29 03:41:13 +00:00
caiomcbr
b1b9d0626c
Support NCCL APIs ( #319 )
...
Start supporting NCCL APIs with a few limitations.
---------
Co-authored-by: Caio Rocha <caio.rocha@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-06-27 23:54:06 +00:00
Roshan Dathathri
91550dab4c
Simplify/improve barrier in AllReduce6 ( #317 )
...
Drop superfluous __threadfence_system()
2024-06-23 21:08:59 +00:00
Angelica Moreira
34f4d9d006
Update quickstart.md ( #314 )
...
Updating the docker image name tag and the python benchmark path.
2024-06-19 22:26:13 +00:00
Roshan Dathathri
93ed8e1e58
Add support for multicast reduce insruction ( #316 )
2024-06-19 13:28:12 -07:00
Binyang Li
1351f9f1c5
Add "packet type" option for executor test ( #313 )
...
Add "packet type" option for executor test
2024-06-14 09:53:58 +00:00
Ziyue Yang
f29095b3b1
Fix NPKit support for AMD ( #312 )
2024-06-14 16:22:14 +08:00
Ziyue Yang
76328fe623
Add NPKit GPU event support ( #310 )
2024-06-13 13:59:50 +08:00
Binyang Li
80aefe55bc
Cumulative Updates ( #309 )
...
Bug fix: Unable to execute communication primitives with the same
execution plan but varying message sizes.
Add reduce_packets OP
2024-06-12 19:17:57 +08:00
Changho Hwang
1f62dfd7cd
Add C++ executor test ( #304 )
...
- Add C++ executor test
- Fix executor bugs for packet operation
- Enhance executor_test.py
---------
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-05-29 10:54:36 +00:00
Changho Hwang
cddffbc8b6
v0.5.1 ( #308 )
v0.5.1
2024-05-26 14:31:29 -07:00
Binyang Li
3a18068cd4
Fix security issue ( #305 )
...
Change sprintf to snprintf to avoid potential security issue
2024-05-25 23:12:57 -07:00
Changho Hwang
f76eae4dca
Fix assert declaration & add a compile test ( #303 )
2024-05-20 02:39:30 +00:00
Changho Hwang
d35a2f2dc2
Rename executor.cpp to executor_py.cpp ( #301 )
2024-05-17 13:31:27 -07:00
Changho Hwang
a3cd95bd42
Upgrade gtest ( #300 )
...
The new gtest version resolves a type casting issue:
3044657e7a
2024-05-07 20:49:26 -07:00
Changho Hwang
9c2a96060a
v0.5.0 ( #298 )
v0.5.0
2024-05-04 16:51:48 -07:00
aashaka
0650371b54
Allow obtaining cuda stream handle from PyTorch stream when launching kernel ( #297 )
...
Use `cuda_stream` attribute of a torch stream if the stream is not an
instance of the cupy stream.
2024-05-04 04:57:07 +00:00
Binyang Li
6226556ce2
Optimized the execution kernel ( #294 )
2024-05-03 11:54:50 -07:00
Binyang Li
fc977ce5dd
Move pipeline to Azure org ( #296 )
...
Move multi-nodes pipeline to Azure org to meet the compliance
requirements. Remove default value for BASE_IMAGE. Not allowed to use
3rd party registry in Dockerfile directly.
2024-04-29 11:54:34 +08:00
Binyang Li
5628362715
Resolve multi-nodes test failure issue ( #295 )
...
Fix bug, resolve multi-nodes test failure issue.
2024-04-26 13:06:57 +08:00
Changho Hwang
d4ede480f4
Ethernet support ( #284 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Caio Rocha <caiorocha@microsoft.com >
2024-04-25 11:06:43 -07:00
Changho Hwang
89896ff94f
Include GPU data types only for kernel code ( #292 )
2024-04-24 20:55:02 -07:00
Changho Hwang
6c1fa5307c
Refactoring NVLS interfaces ( #293 )
...
Move NVLS details from the core to a separate interface
2024-04-24 10:05:41 -07:00
Changho Hwang
9934c982a8
Seperate headers for GPU data types ( #291 )
...
Prevent unnecessarily including data type headers in everywhere.
2024-04-19 05:52:43 +00:00
Roshan Dathathri
41e0964d93
Allow binding allocated memory to NVLS multicast pointer ( #290 )
...
And change NVLS multimem instructions to static functions
2024-04-18 17:11:31 -07:00
Binyang Li
64d837f9ab
Add executor to execute schedule-plan file ( #283 )
...
Add executor to execute the JSON schedule file generated by msccl-tools
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-04-18 19:10:41 +00:00
Changho Hwang
9406123711
Fix a typo name ( #286 )
2024-04-17 23:45:46 +00:00
Changho Hwang
1a7cb98e3a
v0.4.3 ( #279 )
v0.4.3
2024-03-27 11:53:09 -07:00
Changho Hwang
5ba6ce00c7
Fix bootstrapping mechanism ( #278 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Pashupati Kumar <74680231+pash-msft@users.noreply.github.com >
2024-03-27 10:24:24 +08:00
Binyang Li
bc465aefcd
Add __launch_bounds__ for mscclpp-test ( #273 )
2024-03-25 15:55:37 -07:00
Binyang Li
4734d8718f
Fix multi-node ci pipeline ( #272 )
...
Add `__launch_bounds__` to fix perf regression issue in CI pipeline
2024-03-12 09:39:00 -07:00
Changho Hwang
cdaf3aea3d
New packet format & optimizations ( #256 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-02-20 20:01:37 -08:00
Saeed Maleki
a3d0799963
Fix the comm.py for nvls ( #267 )
...
Fix the comm.py for nvls
2024-02-19 10:39:21 +08:00
Binyang Li
5971508eed
Remove cuda-python from project ( #245 )
...
Remove cuda-python and use CuPy APIs instead
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-02-13 21:44:11 +08:00
aashaka
d97fef4395
Allow semaphores and memory to be registered separately in ProxyService ( #264 )
...
This is needed in use cases where SimpleProxyChannel does not suffice.
For example, when a single semaphore is to be used for multiple tensors
or when multiple semaphores should be associated with a tensor.
2024-02-08 09:55:29 -08:00
Binyang Li
7c229fbdd8
Fix multi-nodes test failure ( #262 )
...
fix multi-nodes CI pipeline
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-02-07 18:21:05 -08:00
aashaka
2101f5251e
Allow MSCCL++ CommGroup to take PyTorch tensors in args ( #255 )
...
Obtain data_ptr and tensor_size accordingly for torch.Tensor
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-02-06 19:47:25 -08:00
Changho Hwang
6a19b19ece
Fix NVLS support ( #258 )
...
* Do not compile nvls_test with ROCm
* Fix multi-node tests
2024-02-06 23:24:13 +00:00
Changho Hwang
d34e097b40
Fix wrong offset calculation ( #257 )
2024-02-06 08:55:43 +08:00
Saeed Maleki
91d592dcc0
NVLS support. ( #250 )
...
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-02-04 20:46:10 -08:00
Changho Hwang
4eb0a08b8c
Add putWithSignal() latency tests ( #246 )
2024-01-24 01:10:35 +00:00
aashaka
6c9d159e85
Increase MSCCLPP_BITS_REGMEM_HANDLE to 9 ( #251 )
...
MSCCLPP_BITS_REGMEM_HANDLE=8 limits the number registered memories for a
ProxyService to 256. Many use cases, such as KV cache transfer, require
registering more tensors.
This change allows registering up-to 512 memories. Note that this change
uses up the slack bits remaining in the ChannelTrigger struct.
2024-01-23 13:38:33 -08:00
Binyang Li
422c81f0f8
remove make pylib-copy command ( #249 )
...
Fix #216
Remove `make pylib-copy`
2024-01-19 12:29:15 -08:00
Changho Hwang
1178a9ee47
Minor improvement on device syncer ( #231 )
...
Saved 1~2us in some tests
2024-01-15 17:33:59 -08:00
Changho Hwang
c0fe31fa76
Mask each fields of the trigger ( #244 )
...
The behavior of `ProxyChannelDeviceHandle::put()` is undefined by design
when each field value is given to exceed the bits limitation (such as
`MSCCLPP_BITS_SIZE`). Even so, we'd better trim exceeding bits of each
field value for safety, so that the invalid usage of a field does not
propagate to other fields.
2024-01-10 19:31:47 -08:00
Binyang Li
163cba08c8
Update interface to let user change fifo size ( #243 )
...
Related with this issue:
https://github.com/microsoft/mscclpp/issues/242 . The user may use more
threads than the number specified in `fifo_size` to interact with the
FIFO. In this case, there will be unexpected behavior.
Update the interface to let user change fifo size on their demands.
2024-01-09 22:14:36 -08:00
Binyang Li
e7d3e2d44b
Fix crash in static variable deconstructor ( #238 )
...
According to https://en.cppreference.com/w/cpp/utility/program/exit , `The last destructor for thread-local objects is [sequenced-before](https://en.cppreference.com/w/cpp/language/eval_order ) the first destructor for a static object.`
Change the code to avoid this case.
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2023-12-25 14:01:28 +00:00
Changho Hwang
70e28b3c76
Do not check value of __HIP_PLATFORM_AMD__ ( #240 )
...
According to the
[document](https://rocm.docs.amd.com/projects/HIP/en/docs-6.0.0/user_guide/hip_porting_guide.html#compiler-defines-summary ),
`__HIP_PLATFORM_AMD__` is effective only by definition.
2023-12-25 13:51:18 +08:00