mscclpp

mirror of https://github.com/microsoft/mscclpp.git synced 2026-05-12 09:17:06 +00:00

Author	SHA1	Message	Date
caiomcbr	f4c3c8f916	AllReduce Kernel for Small Messages (#322 ) Adding allreduce kernel code for message sizes smaller than 32 bytes, when the number of elements are smaller than the number of ranks. --------- Co-authored-by: Caio Rocha <caio.rocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-07-05 21:08:43 +00:00
Ziyue Yang	b5a48f836c	Separate NPKit CPU timestamp access from different blocks for AMD platform (#321 ) Reference: https://github.com/ROCm/rccl/pull/1229	2024-07-02 19:36:48 +08:00
Angelica Moreira	0f796bbdf7	Update allreduce_bench.py (#318 ) Replacing hardcoded network interface name for generic discovery strategy. --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-06-29 03:41:13 +00:00
caiomcbr	b1b9d0626c	Support NCCL APIs (#319 ) Start supporting NCCL APIs with a few limitations. --------- Co-authored-by: Caio Rocha <caio.rocha@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-06-27 23:54:06 +00:00
Roshan Dathathri	91550dab4c	Simplify/improve barrier in AllReduce6 (#317 ) Drop superfluous __threadfence_system()	2024-06-23 21:08:59 +00:00
Angelica Moreira	34f4d9d006	Update quickstart.md (#314 ) Updating the docker image name tag and the python benchmark path.	2024-06-19 22:26:13 +00:00
Roshan Dathathri	93ed8e1e58	Add support for multicast reduce insruction (#316 )	2024-06-19 13:28:12 -07:00
Binyang Li	1351f9f1c5	Add "packet type" option for executor test (#313 ) Add "packet type" option for executor test	2024-06-14 09:53:58 +00:00
Ziyue Yang	f29095b3b1	Fix NPKit support for AMD (#312 )	2024-06-14 16:22:14 +08:00
Ziyue Yang	76328fe623	Add NPKit GPU event support (#310 )	2024-06-13 13:59:50 +08:00
Binyang Li	80aefe55bc	Cumulative Updates (#309 ) Bug fix: Unable to execute communication primitives with the same execution plan but varying message sizes. Add reduce_packets OP	2024-06-12 19:17:57 +08:00
Changho Hwang	1f62dfd7cd	Add C++ executor test (#304 ) - Add C++ executor test - Fix executor bugs for packet operation - Enhance executor_test.py --------- Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-05-29 10:54:36 +00:00
Changho Hwang	cddffbc8b6	v0.5.1 (#308 ) v0.5.1	2024-05-26 14:31:29 -07:00
Binyang Li	3a18068cd4	Fix security issue (#305 ) Change sprintf to snprintf to avoid potential security issue	2024-05-25 23:12:57 -07:00
Changho Hwang	f76eae4dca	Fix assert declaration & add a compile test (#303 )	2024-05-20 02:39:30 +00:00
Changho Hwang	d35a2f2dc2	Rename executor.cpp to executor_py.cpp (#301 )	2024-05-17 13:31:27 -07:00
Changho Hwang	a3cd95bd42	Upgrade gtest (#300 ) The new gtest version resolves a type casting issue: `3044657e7a`	2024-05-07 20:49:26 -07:00
Changho Hwang	9c2a96060a	v0.5.0 (#298 ) v0.5.0	2024-05-04 16:51:48 -07:00
aashaka	0650371b54	Allow obtaining cuda stream handle from PyTorch stream when launching kernel (#297 ) Use `cuda_stream` attribute of a torch stream if the stream is not an instance of the cupy stream.	2024-05-04 04:57:07 +00:00
Binyang Li	6226556ce2	Optimized the execution kernel (#294 )	2024-05-03 11:54:50 -07:00
Binyang Li	fc977ce5dd	Move pipeline to Azure org (#296 ) Move multi-nodes pipeline to Azure org to meet the compliance requirements. Remove default value for BASE_IMAGE. Not allowed to use 3rd party registry in Dockerfile directly.	2024-04-29 11:54:34 +08:00
Binyang Li	5628362715	Resolve multi-nodes test failure issue (#295 ) Fix bug, resolve multi-nodes test failure issue.	2024-04-26 13:06:57 +08:00
Changho Hwang	d4ede480f4	Ethernet support (#284 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Caio Rocha <caiorocha@microsoft.com>	2024-04-25 11:06:43 -07:00
Changho Hwang	89896ff94f	Include GPU data types only for kernel code (#292 )	2024-04-24 20:55:02 -07:00
Changho Hwang	6c1fa5307c	Refactoring NVLS interfaces (#293 ) Move NVLS details from the core to a separate interface	2024-04-24 10:05:41 -07:00
Changho Hwang	9934c982a8	Seperate headers for GPU data types (#291 ) Prevent unnecessarily including data type headers in everywhere.	2024-04-19 05:52:43 +00:00
Roshan Dathathri	41e0964d93	Allow binding allocated memory to NVLS multicast pointer (#290 ) And change NVLS multimem instructions to static functions	2024-04-18 17:11:31 -07:00
Binyang Li	64d837f9ab	Add executor to execute schedule-plan file (#283 ) Add executor to execute the JSON schedule file generated by msccl-tools --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-04-18 19:10:41 +00:00
Changho Hwang	9406123711	Fix a typo name (#286 )	2024-04-17 23:45:46 +00:00
Changho Hwang	1a7cb98e3a	v0.4.3 (#279 ) v0.4.3	2024-03-27 11:53:09 -07:00
Changho Hwang	5ba6ce00c7	Fix bootstrapping mechanism (#278 ) Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Pashupati Kumar <74680231+pash-msft@users.noreply.github.com>	2024-03-27 10:24:24 +08:00
Binyang Li	bc465aefcd	Add __launch_bounds__ for mscclpp-test (#273 )	2024-03-25 15:55:37 -07:00
Binyang Li	4734d8718f	Fix multi-node ci pipeline (#272 ) Add `__launch_bounds__` to fix perf regression issue in CI pipeline	2024-03-12 09:39:00 -07:00
Changho Hwang	cdaf3aea3d	New packet format & optimizations (#256 ) Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-02-20 20:01:37 -08:00
Saeed Maleki	a3d0799963	Fix the comm.py for nvls (#267 ) Fix the comm.py for nvls	2024-02-19 10:39:21 +08:00
Binyang Li	5971508eed	Remove cuda-python from project (#245 ) Remove cuda-python and use CuPy APIs instead --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-02-13 21:44:11 +08:00
aashaka	d97fef4395	Allow semaphores and memory to be registered separately in ProxyService (#264 ) This is needed in use cases where SimpleProxyChannel does not suffice. For example, when a single semaphore is to be used for multiple tensors or when multiple semaphores should be associated with a tensor.	2024-02-08 09:55:29 -08:00
Binyang Li	7c229fbdd8	Fix multi-nodes test failure (#262 ) fix multi-nodes CI pipeline Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-02-07 18:21:05 -08:00
aashaka	2101f5251e	Allow MSCCL++ CommGroup to take PyTorch tensors in args (#255 ) Obtain data_ptr and tensor_size accordingly for torch.Tensor Co-authored-by: Binyang Li <binyli@microsoft.com>	2024-02-06 19:47:25 -08:00
Changho Hwang	6a19b19ece	Fix NVLS support (#258 ) * Do not compile nvls_test with ROCm * Fix multi-node tests	2024-02-06 23:24:13 +00:00
Changho Hwang	d34e097b40	Fix wrong offset calculation (#257 )	2024-02-06 08:55:43 +08:00
Saeed Maleki	91d592dcc0	NVLS support. (#250 ) Co-authored-by: Saeed Maleki <saemal@microsoft.com> Co-authored-by: Binyang Li <binyli@microsoft.com> Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2024-02-04 20:46:10 -08:00
Changho Hwang	4eb0a08b8c	Add `putWithSignal()` latency tests (#246 )	2024-01-24 01:10:35 +00:00
aashaka	6c9d159e85	Increase MSCCLPP_BITS_REGMEM_HANDLE to 9 (#251 ) MSCCLPP_BITS_REGMEM_HANDLE=8 limits the number registered memories for a ProxyService to 256. Many use cases, such as KV cache transfer, require registering more tensors. This change allows registering up-to 512 memories. Note that this change uses up the slack bits remaining in the ChannelTrigger struct.	2024-01-23 13:38:33 -08:00
Binyang Li	422c81f0f8	remove make pylib-copy command (#249 ) Fix #216 Remove `make pylib-copy`	2024-01-19 12:29:15 -08:00
Changho Hwang	1178a9ee47	Minor improvement on device syncer (#231 ) Saved 1~2us in some tests	2024-01-15 17:33:59 -08:00
Changho Hwang	c0fe31fa76	Mask each fields of the trigger (#244 ) The behavior of `ProxyChannelDeviceHandle::put()` is undefined by design when each field value is given to exceed the bits limitation (such as `MSCCLPP_BITS_SIZE`). Even so, we'd better trim exceeding bits of each field value for safety, so that the invalid usage of a field does not propagate to other fields.	2024-01-10 19:31:47 -08:00
Binyang Li	163cba08c8	Update interface to let user change fifo size (#243 ) Related with this issue: https://github.com/microsoft/mscclpp/issues/242. The user may use more threads than the number specified in `fifo_size` to interact with the FIFO. In this case, there will be unexpected behavior. Update the interface to let user change fifo size on their demands.	2024-01-09 22:14:36 -08:00
Binyang Li	e7d3e2d44b	Fix crash in static variable deconstructor (#238 ) According to https://en.cppreference.com/w/cpp/utility/program/exit, `The last destructor for thread-local objects is [sequenced-before](https://en.cppreference.com/w/cpp/language/eval_order) the first destructor for a static object.` Change the code to avoid this case. --------- Co-authored-by: Changho Hwang <changhohwang@microsoft.com>	2023-12-25 14:01:28 +00:00
Changho Hwang	70e28b3c76	Do not check value of `__HIP_PLATFORM_AMD__` (#240 ) According to the [document](https://rocm.docs.amd.com/projects/HIP/en/docs-6.0.0/user_guide/hip_porting_guide.html#compiler-defines-summary), `__HIP_PLATFORM_AMD__` is effective only by definition.	2023-12-25 13:51:18 +08:00

1 2 3 4 5 ...

629 Commits