Changho Hwang
1a7cb98e3a
v0.4.3 ( #279 )
v0.4.3
2024-03-27 11:53:09 -07:00
Changho Hwang
5ba6ce00c7
Fix bootstrapping mechanism ( #278 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Pashupati Kumar <74680231+pash-msft@users.noreply.github.com >
2024-03-27 10:24:24 +08:00
Binyang Li
bc465aefcd
Add __launch_bounds__ for mscclpp-test ( #273 )
2024-03-25 15:55:37 -07:00
Binyang Li
4734d8718f
Fix multi-node ci pipeline ( #272 )
...
Add `__launch_bounds__` to fix perf regression issue in CI pipeline
2024-03-12 09:39:00 -07:00
Changho Hwang
cdaf3aea3d
New packet format & optimizations ( #256 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-02-20 20:01:37 -08:00
Saeed Maleki
a3d0799963
Fix the comm.py for nvls ( #267 )
...
Fix the comm.py for nvls
2024-02-19 10:39:21 +08:00
Binyang Li
5971508eed
Remove cuda-python from project ( #245 )
...
Remove cuda-python and use CuPy APIs instead
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-02-13 21:44:11 +08:00
aashaka
d97fef4395
Allow semaphores and memory to be registered separately in ProxyService ( #264 )
...
This is needed in use cases where SimpleProxyChannel does not suffice.
For example, when a single semaphore is to be used for multiple tensors
or when multiple semaphores should be associated with a tensor.
2024-02-08 09:55:29 -08:00
Binyang Li
7c229fbdd8
Fix multi-nodes test failure ( #262 )
...
fix multi-nodes CI pipeline
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-02-07 18:21:05 -08:00
aashaka
2101f5251e
Allow MSCCL++ CommGroup to take PyTorch tensors in args ( #255 )
...
Obtain data_ptr and tensor_size accordingly for torch.Tensor
Co-authored-by: Binyang Li <binyli@microsoft.com >
2024-02-06 19:47:25 -08:00
Changho Hwang
6a19b19ece
Fix NVLS support ( #258 )
...
* Do not compile nvls_test with ROCm
* Fix multi-node tests
2024-02-06 23:24:13 +00:00
Changho Hwang
d34e097b40
Fix wrong offset calculation ( #257 )
2024-02-06 08:55:43 +08:00
Saeed Maleki
91d592dcc0
NVLS support. ( #250 )
...
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2024-02-04 20:46:10 -08:00
Changho Hwang
4eb0a08b8c
Add putWithSignal() latency tests ( #246 )
2024-01-24 01:10:35 +00:00
aashaka
6c9d159e85
Increase MSCCLPP_BITS_REGMEM_HANDLE to 9 ( #251 )
...
MSCCLPP_BITS_REGMEM_HANDLE=8 limits the number registered memories for a
ProxyService to 256. Many use cases, such as KV cache transfer, require
registering more tensors.
This change allows registering up-to 512 memories. Note that this change
uses up the slack bits remaining in the ChannelTrigger struct.
2024-01-23 13:38:33 -08:00
Binyang Li
422c81f0f8
remove make pylib-copy command ( #249 )
...
Fix #216
Remove `make pylib-copy`
2024-01-19 12:29:15 -08:00
Changho Hwang
1178a9ee47
Minor improvement on device syncer ( #231 )
...
Saved 1~2us in some tests
2024-01-15 17:33:59 -08:00
Changho Hwang
c0fe31fa76
Mask each fields of the trigger ( #244 )
...
The behavior of `ProxyChannelDeviceHandle::put()` is undefined by design
when each field value is given to exceed the bits limitation (such as
`MSCCLPP_BITS_SIZE`). Even so, we'd better trim exceeding bits of each
field value for safety, so that the invalid usage of a field does not
propagate to other fields.
2024-01-10 19:31:47 -08:00
Binyang Li
163cba08c8
Update interface to let user change fifo size ( #243 )
...
Related with this issue:
https://github.com/microsoft/mscclpp/issues/242 . The user may use more
threads than the number specified in `fifo_size` to interact with the
FIFO. In this case, there will be unexpected behavior.
Update the interface to let user change fifo size on their demands.
2024-01-09 22:14:36 -08:00
Binyang Li
e7d3e2d44b
Fix crash in static variable deconstructor ( #238 )
...
According to https://en.cppreference.com/w/cpp/utility/program/exit , `The last destructor for thread-local objects is [sequenced-before](https://en.cppreference.com/w/cpp/language/eval_order ) the first destructor for a static object.`
Change the code to avoid this case.
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2023-12-25 14:01:28 +00:00
Changho Hwang
70e28b3c76
Do not check value of __HIP_PLATFORM_AMD__ ( #240 )
...
According to the
[document](https://rocm.docs.amd.com/projects/HIP/en/docs-6.0.0/user_guide/hip_porting_guide.html#compiler-defines-summary ),
`__HIP_PLATFORM_AMD__` is effective only by definition.
2023-12-25 13:51:18 +08:00
Changho Hwang
5fa5bd2706
Check nvidia_peermem during runtime ( #234 )
2023-12-25 12:02:10 +08:00
Changho Hwang
6202b10aa7
Fix #235 ( #239 )
...
Fix #235 breaking Python installation
2023-12-25 00:47:14 +08:00
Changho Hwang
413c9adb4d
Add optional prefix to installation paths ( #235 )
...
This allows another CMake project that includes mscclpp to change the
installation path.
2023-12-22 10:04:08 +00:00
Changho Hwang
f1605b73d6
v0.4.2 ( #236 )
v0.4.2
2023-12-18 11:42:58 +08:00
Changho Hwang
5ff8bc5ef2
Fix & improve perf for ROCm ( #232 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
2023-12-18 11:30:08 +08:00
Changho Hwang
5a9998bfba
Include cstdint in packet_device.hpp ( #233 )
2023-12-07 09:44:35 -08:00
Changho Hwang
c15a166cf0
Add a documentation issue template ( #230 )
v0.4.1
2023-12-05 01:01:45 +00:00
Binyang Li
f1b2c9df12
Fix performance downgrade issue & update doc ( #229 )
...
For push function, we only need to make sure the instruction `st.global`
will be executed after the while loop. Since there is a Write-After-Read
hazard for `trigger.fst` (Check `this->triggers[curFifoHead % size].fst
!= 0` first then write value to `triggers[curFifoHead % size]`), we can
expect the compiler and hardware can handle this situation correctly.
Remove the `release.sys` there.
BTW, `st.global.release.sys.v2.u64` will cause perf regression issue.
Previous we use `st.global.release.cta.v2.u64`, but seems not necessary.
2023-12-04 10:20:10 -08:00
Changho Hwang
351b95b926
Update documents ( #225 )
...
Adding AMD supports on the docs
v0.4.0
2023-11-24 17:00:18 +08:00
Changho Hwang
544ff0c21d
ROCm support ( #213 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
2023-11-24 16:41:56 +08:00
Changho Hwang
dab19e00c1
Templatize Dockerfiles & update workflows ( #223 )
...
Now build images by a script with a shared Dockerfile template
---------
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
2023-11-22 13:29:12 -08:00
Changho Hwang
15f6dcca49
Update documentation ( #217 )
...
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
2023-11-22 12:58:04 -08:00
Changho Hwang
7bd66a938c
Robust correctness test ( #221 )
...
Co-authored-by: Aashaka Shah <aashaka96@gmail.com >
2023-11-22 12:06:50 +08:00
Changho Hwang
3431f37067
Fix DeviceSyncer ( #222 )
2023-11-20 17:15:18 -08:00
Saeed Maleki
70eb6d7328
Fixing the bug in allreduce1 ( #220 )
2023-11-18 10:34:52 -08:00
Saeed Maleki
1d1199703a
Auto-tune single-node AllReduce ( #219 )
...
single node auto-tuner + graph plotter + bug fix for illegal memory access
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2023-11-17 21:42:05 +08:00
Changho Hwang
060fda12e6
mscclpp-test in Python ( #204 )
...
Co-authored-by: Binyang Li <binyli@microsoft.com >
Co-authored-by: Saeed Maleki <saemal@microsoft.com >
Co-authored-by: Esha Choukse <eschouks@microsoft.com >
2023-11-16 12:45:25 +08:00
Changho Hwang
e710701728
Warning ahead of CQ being full ( #202 )
2023-11-15 08:03:29 +00:00
Changho Hwang
4cdb100265
Release GIL for Python APIs with wait ( #190 )
2023-11-14 21:11:01 +08:00
Changho Hwang
3521fb0280
Clear minor warnings ( #214 )
...
Clear warnings from the clang compiler.
2023-11-14 09:28:48 +08:00
Binyang2014
0863e862f5
minor fix ( #211 )
...
minor fix for pipeline
2023-11-03 15:52:42 +08:00
Binyang2014
8a938de9c5
fix pipeline ( #209 )
...
fix pipeline for multi-node test
2023-11-03 05:18:32 +00:00
Binyang2014
db0528ca7f
minor fix ( #203 )
...
minor fix: make sure load/store int4 in the correct bytes order
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2023-10-27 15:19:56 +08:00
Changho Hwang
f68820436c
Explicit build dependency on nvidia_peermem ( #201 )
2023-10-23 04:29:30 +00:00
Changho Hwang
7686e15fbd
Allow infinite waiting ( #200 )
2023-10-23 12:28:05 +08:00
Binyang2014
6f43282c1d
Fix allreduce bug ( #197 )
...
Fix allreduce correctness issue
2023-10-18 23:16:57 +08:00
Saeed Maleki
85e8017535
Atomic for semaphores instead of fences ( #188 )
...
Co-authored-by: Pratyush Patel <pratyushpatel.1995@gmail.com >
Co-authored-by: Esha Choukse <eschouks@microsoft.com >
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2023-10-13 18:57:08 +08:00
Saeed Maleki
c4785c9591
Improve debugging messages ( #195 )
...
Debugging information to understand what connections are being made.
---------
Co-authored-by: Changho Hwang <changhohwang@microsoft.com >
2023-10-13 16:55:52 +08:00
Saeed Maleki
148681b4bc
Fix a pytest bug ( #196 )
2023-10-13 16:39:43 +08:00