Commit Graph

515 Commits

Author SHA1 Message Date
Kawrakow
f0fb76da64 Better GLM-4.7-Flash long context TG performance (#1182)
* Better GLM-4.7-Flash long context TG performance

* Handle quantized cache
2026-01-24 07:05:48 +02:00
Kawrakow
2a7cc09149 Remove llamafile remnants (#1179) 2026-01-22 13:20:23 +02:00
Kawrakow
66caa42b53 Fix build with GGML_CUDA_GRAPHS=OFF 2026-01-22 10:46:57 +00:00
Kawrakow
851fda3509 Split mode graph: use CUDA graphs (#1177)
* Use GUDA graphs also when theretensor overrides

* Change graph key

* This seems to work
2026-01-22 12:38:36 +02:00
Kawrakow
101fe54797 CUDA graphs with tensor overrides (#1172)
* Use GUDA graphs also when theretensor overrides

* Change graph key
2026-01-22 12:28:11 +02:00
Kawrakow
1cb8cd534f Fix build failure when OpenMP is not available (#1171) 2026-01-22 12:26:23 +02:00
Kawrakow
77c18acc90 Fix non-contiguous batched cuBLAS (#1178) 2026-01-22 12:25:05 +02:00
Kawrakow
6f1a69352f Fuse experts bias in top_k_moe kernel (#1170)
* GLM-4.7-Flash support

* Model type

* Make FA work for mla != 0

* Fuse bias in top_k_moe kernel if present
2026-01-20 15:38:51 +02:00
Kawrakow
996e77047a Avoid ggml_get_rows if not necessary (#1160)
* Copy reduce result to other GPUs if necessary

* Avoid ggml_get_rows for TG

* For the output ops use the result of the split that ran on the main GPU

* More models
2026-01-20 15:38:21 +02:00
Kawrakow
132a01d25d GLM-4.7-Flash support (#1168)
* GLM-4.7-Flash support

* Model type

* Make FA work for mla != 0
2026-01-20 12:46:52 +02:00
Kawrakow
98b30e5e81 Faster adaptive_p sampling (#1165)
* A hopefully more efficient adaptive_p sampling

* Once at it, lets fix the formatting too

* More formatting

* Hopefully better

* This should be better

* Correctly accumulate adaptive_p sampling time

* AVX2
2026-01-19 16:03:09 +02:00
Kawrakow
6a5c180be9 Fix bf16 additions on CUDA arch < Ampere (#1164)
* Fix bf16 additions on CUDA arch < Ampere

* Prevent using NCCL if graph reduce type is bf16 and arch < AMPERE
2026-01-19 12:27:52 +02:00
Kawrakow
0c0b6e4b8b Copy reduce result to other GPUs if necessary (#1156) 2026-01-19 08:40:26 +02:00
firecoperana
d71a3ec315 Server: refactor and rename functions (#1151)
* Server: rename functions and refactor code

rename functions

refactor update slots

rename params_base

rename timings

* change

* Revert kv cache name changes

* Revert 2

* fix test build error

---------

Co-authored-by: firecoperana <firecoperana>
2026-01-18 08:16:57 +02:00
Kawrakow
7024fdbc72 Additional graph reduce types for split mode graph (#1154)
* WIP: add Q8_0 and BF16 as possible reduce types

Does not work - there is a big somewhere

* This finally works
2026-01-18 08:02:49 +02:00
Kawrakow
709e1a5375 Fixing split mode graph with many GPUs (#1152)
* Attempt to fix the many GPU issue in split mode graph

* WIP: this seems more stable

Still hanging after a while if I try to use all 7 GPUs

* Reenable OpenMP in scheduler async

Seems solid up to 4 GPUs. It did hang with --max-gpu 6.

* printf cleanup
2026-01-17 08:05:24 +02:00
Kawrakow
c03c2d7cc6 Merge ffn_up and ffn_gate experts tensors (#1137)
* WIP - not working

* WIP - not working

* WIP - GPT-OSS working

However, extremely stupid. The only way I could correctly repack the
up/gate experts is to copy up and gate into host buffers, repack
into another host buffer, copy back into the ffn_up_gate_exps tensor.
This is going to be very slow for giant 500 GB models.

My attempts to do this via a compute graph on the backend holding
the tensors was unsuccessful.

For GPT-OSS-20B I see ~6-7% better PP when using the original
ik_llama.cpp fused_up_gate CUDA implementation, and ~10% when
using the small batch size implementation.

Other models are not working yet on CUDA as I need to fix the
fused mul-unary implementation.

* WIP

* WIP - Qwen3-MoE (and hopefully all others) working

But when I say here and in the previous commit "working",
I mean PP is working. TG is still broken.

* WIP: TG seems to be working

* Minor

* Add command line option to merge experts up/gate

* Add merge up/gate command line parameter to llama-bench

* Turn off merge_up_gate_exps if split mode graph

It is not yet implemented

* When no bias, allow merging up/gate with tensor overrides

* Arghh, we need to increase the context size again

* Cleanup
2026-01-12 18:30:53 +02:00
Kawrakow
c7348f6f55 Fix mla = 0 (#1130)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-10 10:34:30 +02:00
firecoperana
c03ee1a4d2 server: improve speed of speculative decoding (#1119)
* server: improve speed of speculative decoding

change logs

rpc: add recompute

spec dec fix

* Fix n_batch_size not set to context size for draft model

---------

Co-authored-by: firecoperana <firecoperana>
2026-01-10 08:01:22 +02:00
Kawrakow
8725d110d2 Fix data races in the reduce op (#1124)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-09 10:34:58 +02:00
Kawrakow
0456aa47d3 Do not abort on NCCL initizalization failure (#1120)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-08 09:19:50 +02:00
firecoperana
9c1bef35e8 CUDA: compress-mode size (#1110)
Co-authored-by: firecoperana <firecoperana>
2026-01-07 18:33:17 +02:00
Kawrakow
a82dcbf3ee Fix ring reduction (#1114)
* Fix ring reduction

* Actually enable it

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-07 08:01:31 +02:00
Kawrakow
54a513768c Disable ring reduction for now (#1112)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-06 15:40:50 +02:00
Kawrakow
419a397ce0 Graph parallel for Mimo-V2-Flash (#1105)
* WIP

* Cleanup

* Set max_gpu to 2 for Mimo2

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-05 09:58:54 +02:00
Kawrakow
385fc14110 Fix race in CUDA FA for head sizes 192/128 (#1104)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-05 08:21:07 +02:00
Kawrakow
ab50c6cdcb Mimo-V2-Flash support (#1096)
* Mimo-2 support

* Fix bug for head sizes not being the same

It still does not solve the Mimo-2 quantized cache issue.

* Fix quantized cache

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-05 08:00:01 +02:00
firecoperana
56dceefd6b Fix windows build with CUDA (#1101)
Co-authored-by: firecoperana <firecoperana>
2026-01-05 07:59:23 +02:00
Kawrakow
17a5a80946 Fix Windows build (#1097) 2025-12-29 14:18:27 +01:00
Kawrakow
519405dc97 Async compute graph evaluation (2 or more GPUs) (#1089)
* WIP: absorb adding input into std_attn and std_ffn

* WIP: NCCL infra

* WIP: add reduce and fake_cpy ops

* WIP

* WIP: graph appears to work, layer is broken

* WIP: Qwen3-MoE works with graph, layer still broken

* WIP: GLM-4.5 graph works

* WIP: fix sm layer (dense)

* WIP: fix sm layer (MoE)

* WIP: fast PP with bespoke 4-GPU NCCL

I guess, I'm not using NCCL the right way as PP is very
low with a single communicator group for 3 or more GPUs.
But if I create 4 communicator groups for pairs of GPUs
(0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting
1500 t/s for L3-70B on the 4x3090 system, which is
~20% better than the previous sm graph without NCCL.
But that cannot be the solution (I cannot be creating pairwise
communicators and associated logic for every possible number of GPUs).

* WIP: Cohere2

* Explicitely set device

* Bespoke 3-GPU case

* WIP

* Do not repeat get_rows multiple times

* Fix 3 GPUs

* OK, let's leave it in

* Simple async

* This sync seems enough

* Only do async for 4 or more backends

With 2 GPUs (so, 3 backends) not using async is slightly faster

* Scheduler changes

* Use OpenMP if available

Surprisingly (at least to me), this is quite a bit faster than
std::thread and std::barrier. GLM-4.5-AIR with 4 GPUs is now
at 105 t/s at zero context!

* Do not use OpenMP if there are tensor overrides

* Set omp max active levels

* Be more careful with having set the device before using a stream

* Command line option to turn on async. Set to false by defualt for now

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-27 08:18:06 +01:00
Kawrakow
7146de451d Be more careful with having set the device before using a stream (#1093)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-26 19:19:41 +01:00
Kawrakow
8687fca3ff Graph parallel: better PP performance for 3 and more GPUs (#1092)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-26 17:35:27 +01:00
Kawrakow
a2ffceb235 Fix split mode graph when p2p is not enabled (#1091)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-25 08:55:08 +01:00
Kawrakow
3be3649db9 Reduce add improvemens without NCCL (#1088)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-25 08:44:24 +01:00
Kawrakow
ada5cc1523 Fused norm (#1086)
* Adding fused_norm - same idea as fused_rms_norm

* Avoid computing the attention reduce op for cohere2

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-24 15:22:43 +01:00
Kawrakow
0d7eb34185 Graph parallel: the next generation (#1080)
* WIP: absorb adding input into std_attn and std_ffn

* WIP: NCCL infra

* WIP: add reduce and fake_cpy ops

* WIP

* WIP: graph appears to work, layer is broken

* WIP: Qwen3-MoE works with graph, layer still broken

* WIP: GLM-4.5 graph works

* WIP: fix sm layer (dense)

* WIP: fix sm layer (MoE)

* WIP: fast PP with bespoke 4-GPU NCCL

I guess, I'm not using NCCL the right way as PP is very
low with a single communicator group for 3 or more GPUs.
But if I create 4 communicator groups for pairs of GPUs
(0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting
1500 t/s for L3-70B on the 4x3090 system, which is
~20% better than the previous sm graph without NCCL.
But that cannot be the solution (I cannot be creating pairwise
communicators and associated logic for every possible number of GPUs).

* WIP: Cohere2

* Explicitely set device

* Bespoke 3-GPU case

* WIP

* Do not repeat get_rows multiple times

* Fix 3 GPUs

* OK, let's leave it in

* Implement the reduce op without NCCL available

* Be able to build without NCCL

cmake -DGGML_NCCL=OFF disables it

* Make --max-gpu work again

* Slightly better for 4 GPUs without NCCL

* Cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-24 08:31:48 +01:00
Kawrakow
ecabd6acf7 cuda: set device to src device before p2p copy (#1073)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-17 12:50:34 +01:00
Kawrakow
5585ac2aa8 Better PP performance with split mode "graph" and 3+ GPUs (#1069)
* This should do the trick for PP

* Command line option to set max. extra VRAM that the scheduler can use

* Fix bug and cleanup

* Looks like with this change it is working with tensor overrides

* Nah, it is not working

* OK, this seems to be working

* Disable split scheduling with tensor overrides

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-17 07:40:25 +01:00
Kawrakow
75de0528c3 Much better TG speed with split mode "graph" (#1067)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-16 19:48:20 +01:00
firecoperana
0e91b89cd3 Refactor chat and server file (#1062)
* Add alternative log functions

* chat: fix int overflow, prevent size calculation in float/double (#17357)

* chat: fix int overflow, prevent size calculation in float/double

* Update common/chat.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* common : move all common_chat_parse_* to chat-parser.cpp. (#17481)

# Conflicts:
#	common/chat.cpp

* server: split server.cpp code into server/common/task/queue/context

* Fix compiler warning

* Clean up code

* common: use native MultiByteToWideChar

* move server prompt to server task

* Clean code

* delete utils.hpp

---------

Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: DAN™ <dranger003@gmail.com>
2025-12-15 08:27:20 +01:00
Kawrakow
d97a6de34d Split mode "graph" for Cohere2 (#1061)
* This works and TG is descent, but PP is low

* Better

* Apply f_logit_scale before mul mat with output tensor

* This is better for PP: 600 t/s -> 700 t/s

* To not lose this again

* WIP

* Equal split

* WIP

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-13 20:30:08 +01:00
Kawrakow
844a8b0bfa Fix sync logic (#1064)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-13 18:40:49 +01:00
Kawrakow
2e04b7cbef Undo sync reduction (#1063)
I'm finding issues for Qwen3-MoE

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-13 16:58:32 +01:00
Kawrakow
b3a19a6f37 Fix overflow in offset calculation in mmq (#1059)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-12 14:31:06 +01:00
Kawrakow
53fb7a4118 Be able to enable or disable P2P via command line argument (#1058)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-12 13:36:42 +01:00
Kawrakow
f65fefa36c Slightly faster TG for split mode "graph" (#1057)
* Rearrange graph nodes

So that we can do graph portions that are the same on 2 or more
GPUs at the same time.

* Separate graph compute implementation for split mode graph

* This is better

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-12 07:54:37 +01:00
abc-nix
37e41d22dc enable peer access (NVlink) (#1050)
* enable peer access for cuda

* Remove redundant loop
2025-12-11 08:31:56 +01:00
Kawrakow
02206cff46 Reduce back-end syncs (#1049)
* Reduce backend synchronization calls

* Also this

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-11 07:04:44 +01:00
Djip007
5669d39036 Unroll for loop for repacked BF16 MATMUL (#1047)
see https://github.com/ikawrakow/ik_llama.cpp/discussions/1028 for
detail
2025-12-08 06:09:45 +01:00
Kawrakow
2f645f2579 Fix annoying compiler warnings (#1042)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-06 09:59:07 +01:00