mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-11 17:00:22 +00:00

Author	SHA1	Message	Date
Changho Hwang	1f62dfd7cd	Add C++ executor test (#304 ) - Add C++ executor test - Fix executor bugs for packet operation - Enhance executor_test.py --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-05-29 10:54:36 +00:00
Changho Hwang	cddffbc8b6	v0.5.1 (#308 ) v0.5.1	2024-05-26 14:31:29 -07:00
Binyang Li	3a18068cd4	Fix security issue (#305 ) Change sprintf to snprintf to avoid potential security issue	2024-05-25 23:12:57 -07:00
Changho Hwang	f76eae4dca	Fix assert declaration & add a compile test (#303 )	2024-05-20 02:39:30 +00:00
Changho Hwang	d35a2f2dc2	Rename executor.cpp to executor_py.cpp (#301 )	2024-05-17 13:31:27 -07:00
Changho Hwang	a3cd95bd42	Upgrade gtest (#300 ) The new gtest version resolves a type casting issue: `3044657e7a`	2024-05-07 20:49:26 -07:00
Changho Hwang	9c2a96060a	v0.5.0 (#298 ) v0.5.0	2024-05-04 16:51:48 -07:00
aashaka	0650371b54	Allow obtaining cuda stream handle from PyTorch stream when launching kernel (#297 ) Use `cuda_stream` attribute of a torch stream if the stream is not an instance of the cupy stream.	2024-05-04 04:57:07 +00:00
Binyang Li	6226556ce2	Optimized the execution kernel (#294 )	2024-05-03 11:54:50 -07:00
Binyang Li	fc977ce5dd	Move pipeline to Azure org (#296 ) Move multi-nodes pipeline to Azure org to meet the compliance requirements. Remove default value for BASE_IMAGE. Not allowed to use 3rd party registry in Dockerfile directly.	2024-04-29 11:54:34 +08:00
Binyang Li	5628362715	Resolve multi-nodes test failure issue (#295 ) Fix bug, resolve multi-nodes test failure issue.	2024-04-26 13:06:57 +08:00
Changho Hwang	d4ede480f4	Ethernet support (#284 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Caio Rocha <caiorocha@microsoft.com>	2024-04-25 11:06:43 -07:00
Changho Hwang	89896ff94f	Include GPU data types only for kernel code (#292 )	2024-04-24 20:55:02 -07:00
Changho Hwang	6c1fa5307c	Refactoring NVLS interfaces (#293 ) Move NVLS details from the core to a separate interface	2024-04-24 10:05:41 -07:00
Changho Hwang	9934c982a8	Seperate headers for GPU data types (#291 ) Prevent unnecessarily including data type headers in everywhere.	2024-04-19 05:52:43 +00:00
Roshan Dathathri	41e0964d93	Allow binding allocated memory to NVLS multicast pointer (#290 ) And change NVLS multimem instructions to static functions	2024-04-18 17:11:31 -07:00
Binyang Li	64d837f9ab	Add executor to execute schedule-plan file (#283 ) Add executor to execute the JSON schedule file generated by msccl-tools --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-04-18 19:10:41 +00:00
Changho Hwang	9406123711	Fix a typo name (#286 )	2024-04-17 23:45:46 +00:00
Changho Hwang	1a7cb98e3a	v0.4.3 (#279 ) v0.4.3	2024-03-27 11:53:09 -07:00
Changho Hwang	5ba6ce00c7	Fix bootstrapping mechanism (#278 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Pashupati Kumar <74680231+pash-msft@users.noreply.github.com>	2024-03-27 10:24:24 +08:00
Binyang Li	bc465aefcd	Add __launch_bounds__ for mscclpp-test (#273 )	2024-03-25 15:55:37 -07:00
Binyang Li	4734d8718f	Fix multi-node ci pipeline (#272 ) Add `__launch_bounds__` to fix perf regression issue in CI pipeline	2024-03-12 09:39:00 -07:00
Changho Hwang	cdaf3aea3d	New packet format & optimizations (#256 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-02-20 20:01:37 -08:00
Saeed Maleki	a3d0799963	Fix the comm.py for nvls (#267 ) Fix the comm.py for nvls	2024-02-19 10:39:21 +08:00
Binyang Li	5971508eed	Remove cuda-python from project (#245 ) Remove cuda-python and use CuPy APIs instead --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-02-13 21:44:11 +08:00
aashaka	d97fef4395	Allow semaphores and memory to be registered separately in ProxyService (#264 ) This is needed in use cases where SimpleProxyChannel does not suffice. For example, when a single semaphore is to be used for multiple tensors or when multiple semaphores should be associated with a tensor.	2024-02-08 09:55:29 -08:00
Binyang Li	7c229fbdd8	Fix multi-nodes test failure (#262 ) fix multi-nodes CI pipeline Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-02-07 18:21:05 -08:00
aashaka	2101f5251e	Allow MSCCL++ CommGroup to take PyTorch tensors in args (#255 ) Obtain data_ptr and tensor_size accordingly for torch.Tensor Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-02-06 19:47:25 -08:00
Changho Hwang	6a19b19ece	Fix NVLS support (#258 ) * Do not compile nvls_test with ROCm * Fix multi-node tests	2024-02-06 23:24:13 +00:00
Changho Hwang	d34e097b40	Fix wrong offset calculation (#257 )	2024-02-06 08:55:43 +08:00
Saeed Maleki	91d592dcc0	NVLS support. (#250 ) Co-authored-by: Saeed Maleki <saemal@microsoft.com> Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-02-04 20:46:10 -08:00
Changho Hwang	4eb0a08b8c	Add `putWithSignal()` latency tests (#246 )	2024-01-24 01:10:35 +00:00
aashaka	6c9d159e85	Increase MSCCLPP_BITS_REGMEM_HANDLE to 9 (#251 ) MSCCLPP_BITS_REGMEM_HANDLE=8 limits the number registered memories for a ProxyService to 256. Many use cases, such as KV cache transfer, require registering more tensors. This change allows registering up-to 512 memories. Note that this change uses up the slack bits remaining in the ChannelTrigger struct.	2024-01-23 13:38:33 -08:00
Binyang Li	422c81f0f8	remove make pylib-copy command (#249 ) Fix #216 Remove `make pylib-copy`	2024-01-19 12:29:15 -08:00
Changho Hwang	1178a9ee47	Minor improvement on device syncer (#231 ) Saved 1~2us in some tests	2024-01-15 17:33:59 -08:00
Changho Hwang	c0fe31fa76	Mask each fields of the trigger (#244 ) The behavior of `ProxyChannelDeviceHandle::put()` is undefined by design when each field value is given to exceed the bits limitation (such as `MSCCLPP_BITS_SIZE`). Even so, we'd better trim exceeding bits of each field value for safety, so that the invalid usage of a field does not propagate to other fields.	2024-01-10 19:31:47 -08:00
Binyang Li	163cba08c8	Update interface to let user change fifo size (#243 ) Related with this issue: https://github.com/microsoft/mscclpp/issues/242. The user may use more threads than the number specified in `fifo_size` to interact with the FIFO. In this case, there will be unexpected behavior. Update the interface to let user change fifo size on their demands.	2024-01-09 22:14:36 -08:00
Binyang Li	e7d3e2d44b	Fix crash in static variable deconstructor (#238 ) According to https://en.cppreference.com/w/cpp/utility/program/exit, `The last destructor for thread-local objects is [sequenced-before](https://en.cppreference.com/w/cpp/language/eval_order) the first destructor for a static object.` Change the code to avoid this case. --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2023-12-25 14:01:28 +00:00
Changho Hwang	70e28b3c76	Do not check value of `__HIP_PLATFORM_AMD__` (#240 ) According to the [document](https://rocm.docs.amd.com/projects/HIP/en/docs-6.0.0/user_guide/hip_porting_guide.html#compiler-defines-summary), `__HIP_PLATFORM_AMD__` is effective only by definition.	2023-12-25 13:51:18 +08:00
Changho Hwang	5fa5bd2706	Check `nvidia_peermem` during runtime (#234 )	2023-12-25 12:02:10 +08:00
Changho Hwang	6202b10aa7	Fix #235 (#239 ) Fix #235 breaking Python installation	2023-12-25 00:47:14 +08:00
Changho Hwang	413c9adb4d	Add optional prefix to installation paths (#235 ) This allows another CMake project that includes mscclpp to change the installation path.	2023-12-22 10:04:08 +00:00
Changho Hwang	f1605b73d6	v0.4.2 (#236 ) v0.4.2	2023-12-18 11:42:58 +08:00
Changho Hwang	5ff8bc5ef2	Fix & improve perf for ROCm (#232 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2023-12-18 11:30:08 +08:00
Changho Hwang	5a9998bfba	Include `cstdint` in packet_device.hpp (#233 )	2023-12-07 09:44:35 -08:00
Changho Hwang	c15a166cf0	Add a documentation issue template (#230 ) v0.4.1	2023-12-05 01:01:45 +00:00
Binyang Li	f1b2c9df12	Fix performance downgrade issue & update doc (#229 ) For push function, we only need to make sure the instruction `st.global` will be executed after the while loop. Since there is a Write-After-Read hazard for `trigger.fst` (Check `this->triggers[curFifoHead % size].fst != 0` first then write value to `triggers[curFifoHead % size]`), we can expect the compiler and hardware can handle this situation correctly. Remove the `release.sys` there. BTW, `st.global.release.sys.v2.u64` will cause perf regression issue. Previous we use `st.global.release.cta.v2.u64`, but seems not necessary.	2023-12-04 10:20:10 -08:00
Changho Hwang	351b95b926	Update documents (#225 ) Adding AMD supports on the docs v0.4.0	2023-11-24 17:00:18 +08:00
Changho Hwang	544ff0c21d	ROCm support (#213 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2023-11-24 16:41:56 +08:00
Changho Hwang	dab19e00c1	Templatize Dockerfiles & update workflows (#223 ) Now build images by a script with a shared Dockerfile template --------- Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Saeed Maleki <saemal@microsoft.com>	2023-11-22 13:29:12 -08:00

1 2 3 4 5 ...

618 Commits