Commit Graph

507 Commits

Author SHA1 Message Date
Kawrakow
8f98961b96 Fix build failure when OpenMP is not available 2026-01-20 13:06:25 +02:00
Kawrakow
132a01d25d GLM-4.7-Flash support (#1168)
* GLM-4.7-Flash support

* Model type

* Make FA work for mla != 0
2026-01-20 12:46:52 +02:00
Kawrakow
98b30e5e81 Faster adaptive_p sampling (#1165)
* A hopefully more efficient adaptive_p sampling

* Once at it, lets fix the formatting too

* More formatting

* Hopefully better

* This should be better

* Correctly accumulate adaptive_p sampling time

* AVX2
2026-01-19 16:03:09 +02:00
Kawrakow
6a5c180be9 Fix bf16 additions on CUDA arch < Ampere (#1164)
* Fix bf16 additions on CUDA arch < Ampere

* Prevent using NCCL if graph reduce type is bf16 and arch < AMPERE
2026-01-19 12:27:52 +02:00
Kawrakow
0c0b6e4b8b Copy reduce result to other GPUs if necessary (#1156) 2026-01-19 08:40:26 +02:00
firecoperana
d71a3ec315 Server: refactor and rename functions (#1151)
* Server: rename functions and refactor code

rename functions

refactor update slots

rename params_base

rename timings

* change

* Revert kv cache name changes

* Revert 2

* fix test build error

---------

Co-authored-by: firecoperana <firecoperana>
2026-01-18 08:16:57 +02:00
Kawrakow
7024fdbc72 Additional graph reduce types for split mode graph (#1154)
* WIP: add Q8_0 and BF16 as possible reduce types

Does not work - there is a big somewhere

* This finally works
2026-01-18 08:02:49 +02:00
Kawrakow
709e1a5375 Fixing split mode graph with many GPUs (#1152)
* Attempt to fix the many GPU issue in split mode graph

* WIP: this seems more stable

Still hanging after a while if I try to use all 7 GPUs

* Reenable OpenMP in scheduler async

Seems solid up to 4 GPUs. It did hang with --max-gpu 6.

* printf cleanup
2026-01-17 08:05:24 +02:00
Kawrakow
c03c2d7cc6 Merge ffn_up and ffn_gate experts tensors (#1137)
* WIP - not working

* WIP - not working

* WIP - GPT-OSS working

However, extremely stupid. The only way I could correctly repack the
up/gate experts is to copy up and gate into host buffers, repack
into another host buffer, copy back into the ffn_up_gate_exps tensor.
This is going to be very slow for giant 500 GB models.

My attempts to do this via a compute graph on the backend holding
the tensors was unsuccessful.

For GPT-OSS-20B I see ~6-7% better PP when using the original
ik_llama.cpp fused_up_gate CUDA implementation, and ~10% when
using the small batch size implementation.

Other models are not working yet on CUDA as I need to fix the
fused mul-unary implementation.

* WIP

* WIP - Qwen3-MoE (and hopefully all others) working

But when I say here and in the previous commit "working",
I mean PP is working. TG is still broken.

* WIP: TG seems to be working

* Minor

* Add command line option to merge experts up/gate

* Add merge up/gate command line parameter to llama-bench

* Turn off merge_up_gate_exps if split mode graph

It is not yet implemented

* When no bias, allow merging up/gate with tensor overrides

* Arghh, we need to increase the context size again

* Cleanup
2026-01-12 18:30:53 +02:00
Kawrakow
c7348f6f55 Fix mla = 0 (#1130)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-10 10:34:30 +02:00
firecoperana
c03ee1a4d2 server: improve speed of speculative decoding (#1119)
* server: improve speed of speculative decoding

change logs

rpc: add recompute

spec dec fix

* Fix n_batch_size not set to context size for draft model

---------

Co-authored-by: firecoperana <firecoperana>
2026-01-10 08:01:22 +02:00
Kawrakow
8725d110d2 Fix data races in the reduce op (#1124)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-09 10:34:58 +02:00
Kawrakow
0456aa47d3 Do not abort on NCCL initizalization failure (#1120)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-08 09:19:50 +02:00
firecoperana
9c1bef35e8 CUDA: compress-mode size (#1110)
Co-authored-by: firecoperana <firecoperana>
2026-01-07 18:33:17 +02:00
Kawrakow
a82dcbf3ee Fix ring reduction (#1114)
* Fix ring reduction

* Actually enable it

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-07 08:01:31 +02:00
Kawrakow
54a513768c Disable ring reduction for now (#1112)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-06 15:40:50 +02:00
Kawrakow
419a397ce0 Graph parallel for Mimo-V2-Flash (#1105)
* WIP

* Cleanup

* Set max_gpu to 2 for Mimo2

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-05 09:58:54 +02:00
Kawrakow
385fc14110 Fix race in CUDA FA for head sizes 192/128 (#1104)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-05 08:21:07 +02:00
Kawrakow
ab50c6cdcb Mimo-V2-Flash support (#1096)
* Mimo-2 support

* Fix bug for head sizes not being the same

It still does not solve the Mimo-2 quantized cache issue.

* Fix quantized cache

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-05 08:00:01 +02:00
firecoperana
56dceefd6b Fix windows build with CUDA (#1101)
Co-authored-by: firecoperana <firecoperana>
2026-01-05 07:59:23 +02:00
Kawrakow
17a5a80946 Fix Windows build (#1097) 2025-12-29 14:18:27 +01:00
Kawrakow
519405dc97 Async compute graph evaluation (2 or more GPUs) (#1089)
* WIP: absorb adding input into std_attn and std_ffn

* WIP: NCCL infra

* WIP: add reduce and fake_cpy ops

* WIP

* WIP: graph appears to work, layer is broken

* WIP: Qwen3-MoE works with graph, layer still broken

* WIP: GLM-4.5 graph works

* WIP: fix sm layer (dense)

* WIP: fix sm layer (MoE)

* WIP: fast PP with bespoke 4-GPU NCCL

I guess, I'm not using NCCL the right way as PP is very
low with a single communicator group for 3 or more GPUs.
But if I create 4 communicator groups for pairs of GPUs
(0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting
1500 t/s for L3-70B on the 4x3090 system, which is
~20% better than the previous sm graph without NCCL.
But that cannot be the solution (I cannot be creating pairwise
communicators and associated logic for every possible number of GPUs).

* WIP: Cohere2

* Explicitely set device

* Bespoke 3-GPU case

* WIP

* Do not repeat get_rows multiple times

* Fix 3 GPUs

* OK, let's leave it in

* Simple async

* This sync seems enough

* Only do async for 4 or more backends

With 2 GPUs (so, 3 backends) not using async is slightly faster

* Scheduler changes

* Use OpenMP if available

Surprisingly (at least to me), this is quite a bit faster than
std::thread and std::barrier. GLM-4.5-AIR with 4 GPUs is now
at 105 t/s at zero context!

* Do not use OpenMP if there are tensor overrides

* Set omp max active levels

* Be more careful with having set the device before using a stream

* Command line option to turn on async. Set to false by defualt for now

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-27 08:18:06 +01:00
Kawrakow
7146de451d Be more careful with having set the device before using a stream (#1093)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-26 19:19:41 +01:00
Kawrakow
8687fca3ff Graph parallel: better PP performance for 3 and more GPUs (#1092)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-26 17:35:27 +01:00
Kawrakow
a2ffceb235 Fix split mode graph when p2p is not enabled (#1091)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-25 08:55:08 +01:00
Kawrakow
3be3649db9 Reduce add improvemens without NCCL (#1088)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-25 08:44:24 +01:00
Kawrakow
ada5cc1523 Fused norm (#1086)
* Adding fused_norm - same idea as fused_rms_norm

* Avoid computing the attention reduce op for cohere2

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-24 15:22:43 +01:00
Kawrakow
0d7eb34185 Graph parallel: the next generation (#1080)
* WIP: absorb adding input into std_attn and std_ffn

* WIP: NCCL infra

* WIP: add reduce and fake_cpy ops

* WIP

* WIP: graph appears to work, layer is broken

* WIP: Qwen3-MoE works with graph, layer still broken

* WIP: GLM-4.5 graph works

* WIP: fix sm layer (dense)

* WIP: fix sm layer (MoE)

* WIP: fast PP with bespoke 4-GPU NCCL

I guess, I'm not using NCCL the right way as PP is very
low with a single communicator group for 3 or more GPUs.
But if I create 4 communicator groups for pairs of GPUs
(0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting
1500 t/s for L3-70B on the 4x3090 system, which is
~20% better than the previous sm graph without NCCL.
But that cannot be the solution (I cannot be creating pairwise
communicators and associated logic for every possible number of GPUs).

* WIP: Cohere2

* Explicitely set device

* Bespoke 3-GPU case

* WIP

* Do not repeat get_rows multiple times

* Fix 3 GPUs

* OK, let's leave it in

* Implement the reduce op without NCCL available

* Be able to build without NCCL

cmake -DGGML_NCCL=OFF disables it

* Make --max-gpu work again

* Slightly better for 4 GPUs without NCCL

* Cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-24 08:31:48 +01:00
Kawrakow
ecabd6acf7 cuda: set device to src device before p2p copy (#1073)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-17 12:50:34 +01:00
Kawrakow
5585ac2aa8 Better PP performance with split mode "graph" and 3+ GPUs (#1069)
* This should do the trick for PP

* Command line option to set max. extra VRAM that the scheduler can use

* Fix bug and cleanup

* Looks like with this change it is working with tensor overrides

* Nah, it is not working

* OK, this seems to be working

* Disable split scheduling with tensor overrides

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-17 07:40:25 +01:00
Kawrakow
75de0528c3 Much better TG speed with split mode "graph" (#1067)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-16 19:48:20 +01:00
firecoperana
0e91b89cd3 Refactor chat and server file (#1062)
* Add alternative log functions

* chat: fix int overflow, prevent size calculation in float/double (#17357)

* chat: fix int overflow, prevent size calculation in float/double

* Update common/chat.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* common : move all common_chat_parse_* to chat-parser.cpp. (#17481)

# Conflicts:
#	common/chat.cpp

* server: split server.cpp code into server/common/task/queue/context

* Fix compiler warning

* Clean up code

* common: use native MultiByteToWideChar

* move server prompt to server task

* Clean code

* delete utils.hpp

---------

Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: DAN™ <dranger003@gmail.com>
2025-12-15 08:27:20 +01:00
Kawrakow
d97a6de34d Split mode "graph" for Cohere2 (#1061)
* This works and TG is descent, but PP is low

* Better

* Apply f_logit_scale before mul mat with output tensor

* This is better for PP: 600 t/s -> 700 t/s

* To not lose this again

* WIP

* Equal split

* WIP

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-13 20:30:08 +01:00
Kawrakow
844a8b0bfa Fix sync logic (#1064)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-13 18:40:49 +01:00
Kawrakow
2e04b7cbef Undo sync reduction (#1063)
I'm finding issues for Qwen3-MoE

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-13 16:58:32 +01:00
Kawrakow
b3a19a6f37 Fix overflow in offset calculation in mmq (#1059)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-12 14:31:06 +01:00
Kawrakow
53fb7a4118 Be able to enable or disable P2P via command line argument (#1058)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-12 13:36:42 +01:00
Kawrakow
f65fefa36c Slightly faster TG for split mode "graph" (#1057)
* Rearrange graph nodes

So that we can do graph portions that are the same on 2 or more
GPUs at the same time.

* Separate graph compute implementation for split mode graph

* This is better

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-12 07:54:37 +01:00
abc-nix
37e41d22dc enable peer access (NVlink) (#1050)
* enable peer access for cuda

* Remove redundant loop
2025-12-11 08:31:56 +01:00
Kawrakow
02206cff46 Reduce back-end syncs (#1049)
* Reduce backend synchronization calls

* Also this

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-11 07:04:44 +01:00
Djip007
5669d39036 Unroll for loop for repacked BF16 MATMUL (#1047)
see https://github.com/ikawrakow/ik_llama.cpp/discussions/1028 for
detail
2025-12-08 06:09:45 +01:00
Kawrakow
2f645f2579 Fix annoying compiler warnings (#1042)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-06 09:59:07 +01:00
Kawrakow
e02b71f89e Automatically disable CUDA graphs for split mode "graph" (#1040)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-06 07:38:02 +01:00
Kawrakow
0383dfb177 CUDA: set current device in compute_forward (#1039)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-05 16:47:50 +01:00
firecoperana
42e4c61243 CUDA: Fix FA for Pascal GPU (#1036)
Co-authored-by: firecoperana <firecoperana>
2025-12-05 16:42:14 +01:00
Kawrakow
9264abfbaf Fix debug build (#1037)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-05 14:06:22 +01:00
Kawrakow
efc8c8ef8d K-cache Hadamard transforms (CUDA) (#1034)
* Hadamard transforms for K-cache on CUDA

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-04 18:46:22 +01:00
Kawrakow
18fdd80eaf Hadamard transforms for K-cache - CPU only (#1033)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-04 06:51:11 +01:00
Kawrakow
7fbe8d3ac2 Fix bug in ggml_cuda_op_scale_tensor (#1031)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-03 11:32:19 +01:00
Kawrakow
a719349982 POC: CUDA tensor parallel (MoE models) (#1022)
* Remove most of split mode row

* WIP

* WIP: also allocate the KV cache using tensor split

* WIP: it runs with wrong result

But it also looks like the backend scheduler is not going to help:
* It copies mask and input positions to GPU 0
* => RoPE ops must run on GPU 0
* => To proceed attn evaluation, GPU 1 must wait for GPU 0 to finish its
     entire attn calculation
* Same with FFN. The rms_norm gets scheduled on GPU 0. Hence, GPU 1 must
  wait for GPU 0 to finish its entore FFN calculation before it can
  start (as it needs to copy the result of rms_norm from GPU 0)
* => Seems useless without writing a bespoke TP scheduling

* WIP

* This works, but it is slow

* This is slightly better

the graph is still not being computed in parallel.
Why? Because the scheduler creates graph splits where the
result of the computation on one GPU becomes an input for the
other split. Hence, to trigger the computation on the second GPU
one needs to wait for the computation on the first GPU to finish,
even thiough the two can be done in parallel up to the sunchronization
point. So, all that is left to do is to trick the scheduler to create
to splits that can be done in parallel, and then have a graph split
where the results get combined.

* Playing games with the scheduler

This change tricks it into doing the right thing^TM.
Still quite a bit slower than split mode layer for the 8B LlaMA model.
But for the 70B LlaMA it now beats split mode layer for TG:
28 t/s vs 24.4 t/s. PP is 627 t/s vs 744 t/s.
In comparison, split mode "row" in mainline gets
484 t/s PP and 19.3 t/s TG.

* Fix attn split

Granularity for Wq, Wo is not just head size, but
head size * gqa_ratio.
Else the Wk, Wv tensors end up not being a multiple of the
head size when we divide the split determined by Wo with
the gqa_ratio.

* Show memory used per device

* Make it work with partial offload

but no tensor overrides yet, just ngl < num_layers.

* Allow for f16 source in fused_rms_norm

* This results in faster PP.

Now PP is faster than split mode layer for L3-70B.

* Rename split mode "row" to split mode "graph"

* Leave FFN partial results as f16

* WIP GLM4.5 - runs with wrong results

* WIP GLM4.5 - this works

PP is already better than split mode layer, but TG for zero context
is kind of low - 60 vs 92 t/s. TG becomes better than split mode layer
at around 20k tokens. PP at 26k tokens is 1.55X of sm layer.

* Work around compiler bug

It issues a warning that there is an extra semicolon outside of a function,
but there isn't. If I remove the anonymous namespace and turn the
functions inside into static, the warning disapears, so clearly
a compiler bug.

* Make graph reuse work with split mode graph

* Remove more split mode row remnants

* WIP tensor overrides

Runs with wrong results, don't see where the issue could be.

* This works but is slow

Still does not work for row-interleaved quants

* Slightly better

* Slightly better

* Row-interleaved quants work

* Better

* Minor

* Guarad against using split mode "graph" for unsupported models

* Guards against using merge_qkv with split mode "graph"

* WIP split mode attn

Works for LlaMA models, but not for GLM-4.5.
Doesn't seem to improve performance, so I guess no point in trying to
fix it.

* Split mode graph for qwen3moe

* Try to better distribute the splits

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-01 19:25:40 +01:00