Surprisingly (at least to me), this is quite a bit faster than
std::thread and std::barrier. GLM-4.5-AIR with 4 GPUs is now
at 105 t/s at zero context!
* Adding fused_norm - same idea as fused_rms_norm
* Avoid computing the attention reduce op for cohere2
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* WIP: absorb adding input into std_attn and std_ffn
* WIP: NCCL infra
* WIP: add reduce and fake_cpy ops
* WIP
* WIP: graph appears to work, layer is broken
* WIP: Qwen3-MoE works with graph, layer still broken
* WIP: GLM-4.5 graph works
* WIP: fix sm layer (dense)
* WIP: fix sm layer (MoE)
* WIP: fast PP with bespoke 4-GPU NCCL
I guess, I'm not using NCCL the right way as PP is very
low with a single communicator group for 3 or more GPUs.
But if I create 4 communicator groups for pairs of GPUs
(0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting
1500 t/s for L3-70B on the 4x3090 system, which is
~20% better than the previous sm graph without NCCL.
But that cannot be the solution (I cannot be creating pairwise
communicators and associated logic for every possible number of GPUs).
* WIP: Cohere2
* Explicitely set device
* Bespoke 3-GPU case
* WIP
* Do not repeat get_rows multiple times
* Fix 3 GPUs
* OK, let's leave it in
* Implement the reduce op without NCCL available
* Be able to build without NCCL
cmake -DGGML_NCCL=OFF disables it
* Make --max-gpu work again
* Slightly better for 4 GPUs without NCCL
* Cleanup
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
I guess, I'm not using NCCL the right way as PP is very
low with a single communicator group for 3 or more GPUs.
But if I create 4 communicator groups for pairs of GPUs
(0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting
1500 t/s for L3-70B on the 4x3090 system, which is
~20% better than the previous sm graph without NCCL.
But that cannot be the solution (I cannot be creating pairwise
communicators and associated logic for every possible number of GPUs).
Use -smgs or --split-mode-graph-scheduling in CLI to bypass the disabling of split mode graph scheduling when tensor overrides is used.
Co-authored-by: Kawrakow <iwankawrakow@gmail.com>
* This should do the trick for PP
* Command line option to set max. extra VRAM that the scheduler can use
* Fix bug and cleanup
* Looks like with this change it is working with tensor overrides
* Nah, it is not working
* OK, this seems to be working
* Disable split scheduling with tensor overrides
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* This works and TG is descent, but PP is low
* Better
* Apply f_logit_scale before mul mat with output tensor
* This is better for PP: 600 t/s -> 700 t/s
* To not lose this again
* WIP
* Equal split
* WIP
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>