Commit Graph

508 Commits

Author SHA1 Message Date
Iwan Kawrakow
29d323117c Command line option to turn on async. Set to false by defualt for now 2025-12-27 06:24:01 +00:00
Iwan Kawrakow
07759f172c Be more careful with having set the device before using a stream 2025-12-26 18:17:16 +00:00
Iwan Kawrakow
b79bf6c0ef Merge remote-tracking branch 'origin/main' into ik/nccl3_async 2025-12-26 16:36:25 +00:00
Kawrakow
59d0022991 Graph parallel: better PP performance for 3 and more GPUs (#1092)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-26 17:35:27 +01:00
Iwan Kawrakow
443445579f Set omp max active levels 2025-12-26 05:09:27 +00:00
Iwan Kawrakow
072cd216f4 Do not use OpenMP if there are tensor overrides 2025-12-25 17:06:46 +00:00
Iwan Kawrakow
197de25020 Use OpenMP if available
Surprisingly (at least to me), this is quite a bit faster than
std::thread and std::barrier. GLM-4.5-AIR with 4 GPUs is now
at 105 t/s at zero context!
2025-12-25 15:20:37 +00:00
Iwan Kawrakow
4707b09137 Merge remote-tracking branch 'origin/main' into ik/nccl3_async 2025-12-25 07:57:23 +00:00
Kawrakow
03ed5f7096 Fix split mode graph when p2p is not enabled (#1091)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-25 08:55:08 +01:00
Kawrakow
41a8d05420 Reduce add improvemens without NCCL (#1088)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-25 08:44:24 +01:00
Iwan Kawrakow
6803cad2f3 Scheduler changes 2025-12-25 07:18:51 +00:00
Iwan Kawrakow
930c9f7006 Only do async for 4 or more backends
With 2 GPUs (so, 3 backends) not using async is slightly faster
2025-12-24 16:15:50 +00:00
Iwan Kawrakow
16d0dd794c Merge remote-tracking branch 'origin/main' into ik/nccl3_async 2025-12-24 15:28:13 +00:00
Kawrakow
fbb67fa2bd Fused norm (#1086)
* Adding fused_norm - same idea as fused_rms_norm

* Avoid computing the attention reduce op for cohere2

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-24 15:22:43 +01:00
Kawrakow
1d7d0225a0 Graph parallel: the next generation (#1080)
* WIP: absorb adding input into std_attn and std_ffn

* WIP: NCCL infra

* WIP: add reduce and fake_cpy ops

* WIP

* WIP: graph appears to work, layer is broken

* WIP: Qwen3-MoE works with graph, layer still broken

* WIP: GLM-4.5 graph works

* WIP: fix sm layer (dense)

* WIP: fix sm layer (MoE)

* WIP: fast PP with bespoke 4-GPU NCCL

I guess, I'm not using NCCL the right way as PP is very
low with a single communicator group for 3 or more GPUs.
But if I create 4 communicator groups for pairs of GPUs
(0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting
1500 t/s for L3-70B on the 4x3090 system, which is
~20% better than the previous sm graph without NCCL.
But that cannot be the solution (I cannot be creating pairwise
communicators and associated logic for every possible number of GPUs).

* WIP: Cohere2

* Explicitely set device

* Bespoke 3-GPU case

* WIP

* Do not repeat get_rows multiple times

* Fix 3 GPUs

* OK, let's leave it in

* Implement the reduce op without NCCL available

* Be able to build without NCCL

cmake -DGGML_NCCL=OFF disables it

* Make --max-gpu work again

* Slightly better for 4 GPUs without NCCL

* Cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-24 08:31:48 +01:00
Iwan Kawrakow
ef30dd8834 This sync seems enough 2025-12-23 05:17:22 +00:00
Iwan Kawrakow
dc28cadb65 Simple async 2025-12-22 18:43:13 +00:00
Iwan Kawrakow
d4c23f1f89 OK, let's leave it in 2025-12-22 17:13:23 +00:00
Iwan Kawrakow
526ce7e050 Fix 3 GPUs 2025-12-22 16:43:42 +00:00
Iwan Kawrakow
1dd9bf7bcb Do not repeat get_rows multiple times 2025-12-22 13:57:57 +00:00
Iwan Kawrakow
f7cd271cad WIP 2025-12-22 13:36:58 +00:00
Iwan Kawrakow
12c8d3c650 Bespoke 3-GPU case 2025-12-22 11:16:24 +00:00
Iwan Kawrakow
0af67af9a5 Explicitely set device 2025-12-22 11:16:24 +00:00
Iwan Kawrakow
aa3f14b963 WIP: Cohere2 2025-12-22 11:16:24 +00:00
Iwan Kawrakow
d50ef0165e WIP: fast PP with bespoke 4-GPU NCCL
I guess, I'm not using NCCL the right way as PP is very
low with a single communicator group for 3 or more GPUs.
But if I create 4 communicator groups for pairs of GPUs
(0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting
1500 t/s for L3-70B on the 4x3090 system, which is
~20% better than the previous sm graph without NCCL.
But that cannot be the solution (I cannot be creating pairwise
communicators and associated logic for every possible number of GPUs).
2025-12-22 11:16:24 +00:00
Iwan Kawrakow
2b44a0d946 WIP: graph appears to work, layer is broken 2025-12-22 11:16:24 +00:00
Iwan Kawrakow
72fed6daaa WIP 2025-12-22 11:16:24 +00:00
Iwan Kawrakow
5e86e81a2d WIP: add reduce and fake_cpy ops 2025-12-22 11:16:24 +00:00
Iwan Kawrakow
655f6ce301 WIP: NCCL infra 2025-12-22 11:16:24 +00:00
Kawrakow
21fc9322f9 cuda: set device to src device before p2p copy (#1073)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-17 12:50:34 +01:00
Kawrakow
51eea5715f Better PP performance with split mode "graph" and 3+ GPUs (#1069)
* This should do the trick for PP

* Command line option to set max. extra VRAM that the scheduler can use

* Fix bug and cleanup

* Looks like with this change it is working with tensor overrides

* Nah, it is not working

* OK, this seems to be working

* Disable split scheduling with tensor overrides

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-17 07:40:25 +01:00
Kawrakow
8ccceff4e9 Much better TG speed with split mode "graph" (#1067)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-16 19:48:20 +01:00
firecoperana
090f354d33 Refactor chat and server file (#1062)
* Add alternative log functions

* chat: fix int overflow, prevent size calculation in float/double (#17357)

* chat: fix int overflow, prevent size calculation in float/double

* Update common/chat.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* common : move all common_chat_parse_* to chat-parser.cpp. (#17481)

# Conflicts:
#	common/chat.cpp

* server: split server.cpp code into server/common/task/queue/context

* Fix compiler warning

* Clean up code

* common: use native MultiByteToWideChar

* move server prompt to server task

* Clean code

* delete utils.hpp

---------

Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: DAN™ <dranger003@gmail.com>
2025-12-15 08:27:20 +01:00
Kawrakow
f90d1fdd06 Split mode "graph" for Cohere2 (#1061)
* This works and TG is descent, but PP is low

* Better

* Apply f_logit_scale before mul mat with output tensor

* This is better for PP: 600 t/s -> 700 t/s

* To not lose this again

* WIP

* Equal split

* WIP

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-13 20:30:08 +01:00
Kawrakow
5645be6cfc Fix sync logic (#1064)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-13 18:40:49 +01:00
Kawrakow
f667bd58b0 Undo sync reduction (#1063)
I'm finding issues for Qwen3-MoE

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-13 16:58:32 +01:00
Kawrakow
cc14d4a3cc Fix overflow in offset calculation in mmq (#1059)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-12 14:31:06 +01:00
Kawrakow
b74fb479af Be able to enable or disable P2P via command line argument (#1058)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-12 13:36:42 +01:00
Kawrakow
0698501ae2 Slightly faster TG for split mode "graph" (#1057)
* Rearrange graph nodes

So that we can do graph portions that are the same on 2 or more
GPUs at the same time.

* Separate graph compute implementation for split mode graph

* This is better

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-12 07:54:37 +01:00
abc-nix
0feb046e6b enable peer access (NVlink) (#1050)
* enable peer access for cuda

* Remove redundant loop
2025-12-11 08:31:56 +01:00
Kawrakow
00d939c811 Reduce back-end syncs (#1049)
* Reduce backend synchronization calls

* Also this

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-11 07:04:44 +01:00
Djip007
808ce4907c Unroll for loop for repacked BF16 MATMUL (#1047)
see https://github.com/ikawrakow/ik_llama.cpp/discussions/1028 for
detail
2025-12-08 06:09:45 +01:00
Kawrakow
c9fcfb9a7a Fix annoying compiler warnings (#1042)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-06 09:59:07 +01:00
Kawrakow
87f6943e4b Automatically disable CUDA graphs for split mode "graph" (#1040)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-06 07:38:02 +01:00
Kawrakow
a3737f4296 CUDA: set current device in compute_forward (#1039)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-05 16:47:50 +01:00
firecoperana
e741ec8a5d CUDA: Fix FA for Pascal GPU (#1036)
Co-authored-by: firecoperana <firecoperana>
2025-12-05 16:42:14 +01:00
Kawrakow
b43801a2d2 Fix debug build (#1037)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-05 14:06:22 +01:00
Kawrakow
b715342e82 K-cache Hadamard transforms (CUDA) (#1034)
* Hadamard transforms for K-cache on CUDA

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-04 18:46:22 +01:00
Kawrakow
658ced0abd Hadamard transforms for K-cache - CPU only (#1033)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-04 06:51:11 +01:00
Kawrakow
74c56067b4 Fix bug in ggml_cuda_op_scale_tensor (#1031)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-03 11:32:19 +01:00