mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-26 08:01:00 +00:00

Author	SHA1	Message	Date
Changho Hwang	1a7cb98e3a	v0.4.3 (#279 ) v0.4.3	2024-03-27 11:53:09 -07:00
Changho Hwang	5ba6ce00c7	Fix bootstrapping mechanism (#278 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Pashupati Kumar <74680231+pash-msft@users.noreply.github.com>	2024-03-27 10:24:24 +08:00
Binyang Li	bc465aefcd	Add __launch_bounds__ for mscclpp-test (#273 )	2024-03-25 15:55:37 -07:00
Binyang Li	4734d8718f	Fix multi-node ci pipeline (#272 ) Add `__launch_bounds__` to fix perf regression issue in CI pipeline	2024-03-12 09:39:00 -07:00
Changho Hwang	cdaf3aea3d	New packet format & optimizations (#256 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-02-20 20:01:37 -08:00
Saeed Maleki	a3d0799963	Fix the comm.py for nvls (#267 ) Fix the comm.py for nvls	2024-02-19 10:39:21 +08:00
Binyang Li	5971508eed	Remove cuda-python from project (#245 ) Remove cuda-python and use CuPy APIs instead --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-02-13 21:44:11 +08:00
aashaka	d97fef4395	Allow semaphores and memory to be registered separately in ProxyService (#264 ) This is needed in use cases where SimpleProxyChannel does not suffice. For example, when a single semaphore is to be used for multiple tensors or when multiple semaphores should be associated with a tensor.	2024-02-08 09:55:29 -08:00
Binyang Li	7c229fbdd8	Fix multi-nodes test failure (#262 ) fix multi-nodes CI pipeline Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-02-07 18:21:05 -08:00
aashaka	2101f5251e	Allow MSCCL++ CommGroup to take PyTorch tensors in args (#255 ) Obtain data_ptr and tensor_size accordingly for torch.Tensor Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-02-06 19:47:25 -08:00
Changho Hwang	6a19b19ece	Fix NVLS support (#258 ) * Do not compile nvls_test with ROCm * Fix multi-node tests	2024-02-06 23:24:13 +00:00
Changho Hwang	d34e097b40	Fix wrong offset calculation (#257 )	2024-02-06 08:55:43 +08:00
Saeed Maleki	91d592dcc0	NVLS support. (#250 ) Co-authored-by: Saeed Maleki <saemal@microsoft.com> Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-02-04 20:46:10 -08:00
Changho Hwang	4eb0a08b8c	Add `putWithSignal()` latency tests (#246 )	2024-01-24 01:10:35 +00:00
aashaka	6c9d159e85	Increase MSCCLPP_BITS_REGMEM_HANDLE to 9 (#251 ) MSCCLPP_BITS_REGMEM_HANDLE=8 limits the number registered memories for a ProxyService to 256. Many use cases, such as KV cache transfer, require registering more tensors. This change allows registering up-to 512 memories. Note that this change uses up the slack bits remaining in the ChannelTrigger struct.	2024-01-23 13:38:33 -08:00
Binyang Li	422c81f0f8	remove make pylib-copy command (#249 ) Fix #216 Remove `make pylib-copy`	2024-01-19 12:29:15 -08:00
Changho Hwang	1178a9ee47	Minor improvement on device syncer (#231 ) Saved 1~2us in some tests	2024-01-15 17:33:59 -08:00
Changho Hwang	c0fe31fa76	Mask each fields of the trigger (#244 ) The behavior of `ProxyChannelDeviceHandle::put()` is undefined by design when each field value is given to exceed the bits limitation (such as `MSCCLPP_BITS_SIZE`). Even so, we'd better trim exceeding bits of each field value for safety, so that the invalid usage of a field does not propagate to other fields.	2024-01-10 19:31:47 -08:00
Binyang Li	163cba08c8	Update interface to let user change fifo size (#243 ) Related with this issue: https://github.com/microsoft/mscclpp/issues/242. The user may use more threads than the number specified in `fifo_size` to interact with the FIFO. In this case, there will be unexpected behavior. Update the interface to let user change fifo size on their demands.	2024-01-09 22:14:36 -08:00
Binyang Li	e7d3e2d44b	Fix crash in static variable deconstructor (#238 ) According to https://en.cppreference.com/w/cpp/utility/program/exit, `The last destructor for thread-local objects is [sequenced-before](https://en.cppreference.com/w/cpp/language/eval_order) the first destructor for a static object.` Change the code to avoid this case. --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2023-12-25 14:01:28 +00:00
Changho Hwang	70e28b3c76	Do not check value of `__HIP_PLATFORM_AMD__` (#240 ) According to the [document](https://rocm.docs.amd.com/projects/HIP/en/docs-6.0.0/user_guide/hip_porting_guide.html#compiler-defines-summary), `__HIP_PLATFORM_AMD__` is effective only by definition.	2023-12-25 13:51:18 +08:00
Changho Hwang	5fa5bd2706	Check `nvidia_peermem` during runtime (#234 )	2023-12-25 12:02:10 +08:00
Changho Hwang	6202b10aa7	Fix #235 (#239 ) Fix #235 breaking Python installation	2023-12-25 00:47:14 +08:00
Changho Hwang	413c9adb4d	Add optional prefix to installation paths (#235 ) This allows another CMake project that includes mscclpp to change the installation path.	2023-12-22 10:04:08 +00:00
Changho Hwang	f1605b73d6	v0.4.2 (#236 ) v0.4.2	2023-12-18 11:42:58 +08:00
Changho Hwang	5ff8bc5ef2	Fix & improve perf for ROCm (#232 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2023-12-18 11:30:08 +08:00
Changho Hwang	5a9998bfba	Include `cstdint` in packet_device.hpp (#233 )	2023-12-07 09:44:35 -08:00
Changho Hwang	c15a166cf0	Add a documentation issue template (#230 ) v0.4.1	2023-12-05 01:01:45 +00:00
Binyang Li	f1b2c9df12	Fix performance downgrade issue & update doc (#229 ) For push function, we only need to make sure the instruction `st.global` will be executed after the while loop. Since there is a Write-After-Read hazard for `trigger.fst` (Check `this->triggers[curFifoHead % size].fst != 0` first then write value to `triggers[curFifoHead % size]`), we can expect the compiler and hardware can handle this situation correctly. Remove the `release.sys` there. BTW, `st.global.release.sys.v2.u64` will cause perf regression issue. Previous we use `st.global.release.cta.v2.u64`, but seems not necessary.	2023-12-04 10:20:10 -08:00
Changho Hwang	351b95b926	Update documents (#225 ) Adding AMD supports on the docs v0.4.0	2023-11-24 17:00:18 +08:00
Changho Hwang	544ff0c21d	ROCm support (#213 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2023-11-24 16:41:56 +08:00
Changho Hwang	dab19e00c1	Templatize Dockerfiles & update workflows (#223 ) Now build images by a script with a shared Dockerfile template --------- Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Saeed Maleki <saemal@microsoft.com>	2023-11-22 13:29:12 -08:00
Changho Hwang	15f6dcca49	Update documentation (#217 ) Co-authored-by: Saeed Maleki <saemal@microsoft.com>	2023-11-22 12:58:04 -08:00
Changho Hwang	7bd66a938c	Robust correctness test (#221 ) Co-authored-by: Aashaka Shah <aashaka96@gmail.com>	2023-11-22 12:06:50 +08:00
Changho Hwang	3431f37067	Fix DeviceSyncer (#222 )	2023-11-20 17:15:18 -08:00
Saeed Maleki	70eb6d7328	Fixing the bug in allreduce1 (#220 )	2023-11-18 10:34:52 -08:00
Saeed Maleki	1d1199703a	Auto-tune single-node AllReduce (#219 ) single node auto-tuner + graph plotter + bug fix for illegal memory access --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2023-11-17 21:42:05 +08:00
Changho Hwang	060fda12e6	mscclpp-test in Python (#204 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Saeed Maleki <saemal@microsoft.com> Co-authored-by: Esha Choukse <eschouks@microsoft.com>	2023-11-16 12:45:25 +08:00
Changho Hwang	e710701728	Warning ahead of CQ being full (#202 )	2023-11-15 08:03:29 +00:00
Changho Hwang	4cdb100265	Release GIL for Python APIs with wait (#190 )	2023-11-14 21:11:01 +08:00
Changho Hwang	3521fb0280	Clear minor warnings (#214 ) Clear warnings from the clang compiler.	2023-11-14 09:28:48 +08:00
Binyang2014	0863e862f5	minor fix (#211 ) minor fix for pipeline	2023-11-03 15:52:42 +08:00
Binyang2014	8a938de9c5	fix pipeline (#209 ) fix pipeline for multi-node test	2023-11-03 05:18:32 +00:00
Binyang2014	db0528ca7f	minor fix (#203 ) minor fix: make sure load/store int4 in the correct bytes order Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2023-10-27 15:19:56 +08:00
Changho Hwang	f68820436c	Explicit build dependency on `nvidia_peermem` (#201 )	2023-10-23 04:29:30 +00:00
Changho Hwang	7686e15fbd	Allow infinite waiting (#200 )	2023-10-23 12:28:05 +08:00
Binyang2014	6f43282c1d	Fix allreduce bug (#197 ) Fix allreduce correctness issue	2023-10-18 23:16:57 +08:00
Saeed Maleki	85e8017535	Atomic for semaphores instead of fences (#188 ) Co-authored-by: Pratyush Patel <pratyushpatel.1995@gmail.com> Co-authored-by: Esha Choukse <eschouks@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2023-10-13 18:57:08 +08:00
Saeed Maleki	c4785c9591	Improve debugging messages (#195 ) Debugging information to understand what connections are being made. --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2023-10-13 16:55:52 +08:00
Saeed Maleki	148681b4bc	Fix a pytest bug (#196 )	2023-10-13 16:39:43 +08:00

1 2 3 4 5 ...

600 Commits