Files
ik_llama.cpp/ggml/src/ggml-cuda
Kawrakow 1d7d0225a0 Graph parallel: the next generation (#1080)
* WIP: absorb adding input into std_attn and std_ffn

* WIP: NCCL infra

* WIP: add reduce and fake_cpy ops

* WIP

* WIP: graph appears to work, layer is broken

* WIP: Qwen3-MoE works with graph, layer still broken

* WIP: GLM-4.5 graph works

* WIP: fix sm layer (dense)

* WIP: fix sm layer (MoE)

* WIP: fast PP with bespoke 4-GPU NCCL

I guess, I'm not using NCCL the right way as PP is very
low with a single communicator group for 3 or more GPUs.
But if I create 4 communicator groups for pairs of GPUs
(0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting
1500 t/s for L3-70B on the 4x3090 system, which is
~20% better than the previous sm graph without NCCL.
But that cannot be the solution (I cannot be creating pairwise
communicators and associated logic for every possible number of GPUs).

* WIP: Cohere2

* Explicitely set device

* Bespoke 3-GPU case

* WIP

* Do not repeat get_rows multiple times

* Fix 3 GPUs

* OK, let's leave it in

* Implement the reduce op without NCCL available

* Be able to build without NCCL

cmake -DGGML_NCCL=OFF disables it

* Make --max-gpu work again

* Slightly better for 4 GPUs without NCCL

* Cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-24 08:31:48 +01:00
..
2024-07-27 07:55:01 +02:00
2024-07-27 07:55:01 +02:00
2024-07-27 07:55:01 +02:00
2024-07-27 07:55:01 +02:00
2025-10-22 16:18:11 +03:00
2024-07-27 07:55:01 +02:00
2024-07-27 07:55:01 +02:00
2024-07-27 07:55:01 +02:00
2025-04-07 10:43:26 +02:00
2024-07-27 07:55:01 +02:00
2025-08-09 08:40:18 +03:00
2025-05-12 07:49:00 +03:00
2025-12-13 20:30:08 +01:00
2024-07-27 07:55:01 +02:00
2024-07-27 07:55:01 +02:00
2024-08-12 15:14:32 +02:00
2025-11-24 06:55:14 +01:00
2025-09-23 16:43:02 +02:00
2025-11-24 06:55:14 +01:00
2025-11-27 15:58:18 +01:00
2025-04-07 10:43:26 +02:00
2024-07-27 07:55:01 +02:00
2025-09-27 11:15:32 +02:00
2024-07-27 07:55:01 +02:00
2024-07-27 07:55:01 +02:00
2025-11-18 08:55:36 +00:00
2025-11-09 11:34:33 +02:00
2025-11-09 11:34:33 +02:00
2025-10-27 16:09:01 +02:00
2025-10-24 07:40:35 +03:00
2025-10-24 07:40:35 +03:00
2025-12-13 20:30:08 +01:00
2024-07-27 07:55:01 +02:00
2024-07-27 07:55:01 +02:00
2024-07-27 07:55:01 +02:00
2024-07-27 07:55:01 +02:00
2025-11-09 11:34:33 +02:00
2025-11-09 11:34:33 +02:00
2025-11-19 09:08:42 +01:00
2025-11-19 09:08:42 +01:00
2024-07-27 07:55:01 +02:00
2025-04-07 10:43:26 +02:00
2025-04-07 10:43:26 +02:00
2025-04-07 10:43:26 +02:00
2025-10-22 16:18:11 +03:00
2025-11-19 15:48:39 +01:00
2024-07-27 07:55:01 +02:00
2024-07-27 07:55:01 +02:00
2025-10-22 16:18:11 +03:00
2024-07-27 07:55:01 +02:00
2024-07-27 07:55:01 +02:00