Changho Hwang
1f62dfd7cd
Add C++ executor test ( #304 )
...
- Add C++ executor test
- Fix executor bugs for packet operation
- Enhance executor_test.py
---------
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-05-29 10:54:36 +00:00
Changho Hwang
cddffbc8b6
v0.5.1 ( #308 )
v0.5.1
2024-05-26 14:31:29 -07:00
Binyang Li
3a18068cd4
Fix security issue ( #305 )
...
Change sprintf to snprintf to avoid potential security issue
2024-05-25 23:12:57 -07:00
Changho Hwang
f76eae4dca
Fix assert declaration & add a compile test ( #303 )
2024-05-20 02:39:30 +00:00
Changho Hwang
d35a2f2dc2
Rename executor.cpp to executor_py.cpp ( #301 )
2024-05-17 13:31:27 -07:00
Changho Hwang
a3cd95bd42
Upgrade gtest ( #300 )
...
The new gtest version resolves a type casting issue:
3044657e7a
2024-05-07 20:49:26 -07:00
Changho Hwang
9c2a96060a
v0.5.0 ( #298 )
v0.5.0
2024-05-04 16:51:48 -07:00
aashaka
0650371b54
Allow obtaining cuda stream handle from PyTorch stream when launching kernel ( #297 )
...
Use `cuda_stream` attribute of a torch stream if the stream is not an
instance of the cupy stream.
2024-05-04 04:57:07 +00:00
Binyang Li
6226556ce2
Optimized the execution kernel ( #294 )
2024-05-03 11:54:50 -07:00
Binyang Li
fc977ce5dd
Move pipeline to Azure org ( #296 )
...
Move multi-nodes pipeline to Azure org to meet the compliance
requirements. Remove default value for BASE_IMAGE. Not allowed to use
3rd party registry in Dockerfile directly.
2024-04-29 11:54:34 +08:00
Binyang Li
5628362715
Resolve multi-nodes test failure issue ( #295 )
...
Fix bug, resolve multi-nodes test failure issue.
2024-04-26 13:06:57 +08:00
Changho Hwang
d4ede480f4
Ethernet support ( #284 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Caio Rocha <caiorocha@microsoft.com >
2024-04-25 11:06:43 -07:00
Changho Hwang
89896ff94f
Include GPU data types only for kernel code ( #292 )
2024-04-24 20:55:02 -07:00
Changho Hwang
6c1fa5307c
Refactoring NVLS interfaces ( #293 )
...
Move NVLS details from the core to a separate interface
2024-04-24 10:05:41 -07:00
Changho Hwang
9934c982a8
Seperate headers for GPU data types ( #291 )
...
Prevent unnecessarily including data type headers in everywhere.
2024-04-19 05:52:43 +00:00
Roshan Dathathri
41e0964d93
Allow binding allocated memory to NVLS multicast pointer ( #290 )
...
And change NVLS multimem instructions to static functions
2024-04-18 17:11:31 -07:00
Binyang Li
64d837f9ab
Add executor to execute schedule-plan file ( #283 )
...
Add executor to execute the JSON schedule file generated by msccl-tools
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-04-18 19:10:41 +00:00
Changho Hwang
9406123711
Fix a typo name ( #286 )
2024-04-17 23:45:46 +00:00
Changho Hwang
1a7cb98e3a
v0.4.3 ( #279 )
v0.4.3
2024-03-27 11:53:09 -07:00
Changho Hwang
5ba6ce00c7
Fix bootstrapping mechanism ( #278 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Pashupati Kumar <74680231+pash-msft@users.noreply.github.com >
2024-03-27 10:24:24 +08:00
Binyang Li
bc465aefcd
Add __launch_bounds__ for mscclpp-test ( #273 )
2024-03-25 15:55:37 -07:00
Binyang Li
4734d8718f
Fix multi-node ci pipeline ( #272 )
...
Add `__launch_bounds__` to fix perf regression issue in CI pipeline
2024-03-12 09:39:00 -07:00
Changho Hwang
cdaf3aea3d
New packet format & optimizations ( #256 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-02-20 20:01:37 -08:00
Saeed Maleki
a3d0799963
Fix the comm.py for nvls ( #267 )
...
Fix the comm.py for nvls
2024-02-19 10:39:21 +08:00
Binyang Li
5971508eed
Remove cuda-python from project ( #245 )
...
Remove cuda-python and use CuPy APIs instead
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-02-13 21:44:11 +08:00
aashaka
d97fef4395
Allow semaphores and memory to be registered separately in ProxyService ( #264 )
...
This is needed in use cases where SimpleProxyChannel does not suffice.
For example, when a single semaphore is to be used for multiple tensors
or when multiple semaphores should be associated with a tensor.
2024-02-08 09:55:29 -08:00
Binyang Li
7c229fbdd8
Fix multi-nodes test failure ( #262 )
...
fix multi-nodes CI pipeline
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-02-07 18:21:05 -08:00
aashaka
2101f5251e
Allow MSCCL++ CommGroup to take PyTorch tensors in args ( #255 )
...
Obtain data_ptr and tensor_size accordingly for torch.Tensor
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-02-06 19:47:25 -08:00
Changho Hwang
6a19b19ece
Fix NVLS support ( #258 )
...
* Do not compile nvls_test with ROCm
* Fix multi-node tests
2024-02-06 23:24:13 +00:00
Changho Hwang
d34e097b40
Fix wrong offset calculation ( #257 )
2024-02-06 08:55:43 +08:00
Saeed Maleki
91d592dcc0
NVLS support. ( #250 )
...
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-02-04 20:46:10 -08:00
Changho Hwang
4eb0a08b8c
Add putWithSignal() latency tests ( #246 )
2024-01-24 01:10:35 +00:00
aashaka
6c9d159e85
Increase MSCCLPP_BITS_REGMEM_HANDLE to 9 ( #251 )
...
MSCCLPP_BITS_REGMEM_HANDLE=8 limits the number registered memories for a
ProxyService to 256. Many use cases, such as KV cache transfer, require
registering more tensors.
This change allows registering up-to 512 memories. Note that this change
uses up the slack bits remaining in the ChannelTrigger struct.
2024-01-23 13:38:33 -08:00
Binyang Li
422c81f0f8
remove make pylib-copy command ( #249 )
...
Fix #216
Remove `make pylib-copy`
2024-01-19 12:29:15 -08:00
Changho Hwang
1178a9ee47
Minor improvement on device syncer ( #231 )
...
Saved 1~2us in some tests
2024-01-15 17:33:59 -08:00
Changho Hwang
c0fe31fa76
Mask each fields of the trigger ( #244 )
...
The behavior of `ProxyChannelDeviceHandle::put()` is undefined by design
when each field value is given to exceed the bits limitation (such as
`MSCCLPP_BITS_SIZE`). Even so, we'd better trim exceeding bits of each
field value for safety, so that the invalid usage of a field does not
propagate to other fields.
2024-01-10 19:31:47 -08:00
Binyang Li
163cba08c8
Update interface to let user change fifo size ( #243 )
...
Related with this issue:
https://github.com/microsoft/mscclpp/issues/242 . The user may use more
threads than the number specified in `fifo_size` to interact with the
FIFO. In this case, there will be unexpected behavior.
Update the interface to let user change fifo size on their demands.
2024-01-09 22:14:36 -08:00
Binyang Li
e7d3e2d44b
Fix crash in static variable deconstructor ( #238 )
...
According to https://en.cppreference.com/w/cpp/utility/program/exit , `The last destructor for thread-local objects is [sequenced-before](https://en.cppreference.com/w/cpp/language/eval_order ) the first destructor for a static object.`
Change the code to avoid this case.
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2023-12-25 14:01:28 +00:00
Changho Hwang
70e28b3c76
Do not check value of __HIP_PLATFORM_AMD__ ( #240 )
...
According to the
[document](https://rocm.docs.amd.com/projects/HIP/en/docs-6.0.0/user_guide/hip_porting_guide.html#compiler-defines-summary ),
`__HIP_PLATFORM_AMD__` is effective only by definition.
2023-12-25 13:51:18 +08:00
Changho Hwang
5fa5bd2706
Check nvidia_peermem during runtime ( #234 )
2023-12-25 12:02:10 +08:00
Changho Hwang
6202b10aa7
Fix #235 ( #239 )
...
Fix #235 breaking Python installation
2023-12-25 00:47:14 +08:00
Changho Hwang
413c9adb4d
Add optional prefix to installation paths ( #235 )
...
This allows another CMake project that includes mscclpp to change the
installation path.
2023-12-22 10:04:08 +00:00
Changho Hwang
f1605b73d6
v0.4.2 ( #236 )
v0.4.2
2023-12-18 11:42:58 +08:00
Changho Hwang
5ff8bc5ef2
Fix & improve perf for ROCm ( #232 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
2023-12-18 11:30:08 +08:00
Changho Hwang
5a9998bfba
Include cstdint in packet_device.hpp ( #233 )
2023-12-07 09:44:35 -08:00
Changho Hwang
c15a166cf0
Add a documentation issue template ( #230 )
v0.4.1
2023-12-05 01:01:45 +00:00
Binyang Li
f1b2c9df12
Fix performance downgrade issue & update doc ( #229 )
...
For push function, we only need to make sure the instruction `st.global`
will be executed after the while loop. Since there is a Write-After-Read
hazard for `trigger.fst` (Check `this->triggers[curFifoHead % size].fst
!= 0` first then write value to `triggers[curFifoHead % size]`), we can
expect the compiler and hardware can handle this situation correctly.
Remove the `release.sys` there.
BTW, `st.global.release.sys.v2.u64` will cause perf regression issue.
Previous we use `st.global.release.cta.v2.u64`, but seems not necessary.
2023-12-04 10:20:10 -08:00
Changho Hwang
351b95b926
Update documents ( #225 )
...
Adding AMD supports on the docs
v0.4.0
2023-11-24 17:00:18 +08:00
Changho Hwang
544ff0c21d
ROCm support ( #213 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
2023-11-24 16:41:56 +08:00
Changho Hwang
dab19e00c1
Templatize Dockerfiles & update workflows ( #223 )
...
Now build images by a script with a shared Dockerfile template
---------
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
2023-11-22 13:29:12 -08:00