Commit Graph

4109 Commits

Author SHA1 Message Date
Iwan Kawrakow
29d323117c Command line option to turn on async. Set to false by defualt for now 2025-12-27 06:24:01 +00:00
Iwan Kawrakow
07759f172c Be more careful with having set the device before using a stream 2025-12-26 18:17:16 +00:00
Iwan Kawrakow
b79bf6c0ef Merge remote-tracking branch 'origin/main' into ik/nccl3_async 2025-12-26 16:36:25 +00:00
Kawrakow
59d0022991 Graph parallel: better PP performance for 3 and more GPUs (#1092)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-26 17:35:27 +01:00
Iwan Kawrakow
443445579f Set omp max active levels 2025-12-26 05:09:27 +00:00
Iwan Kawrakow
072cd216f4 Do not use OpenMP if there are tensor overrides 2025-12-25 17:06:46 +00:00
Iwan Kawrakow
197de25020 Use OpenMP if available
Surprisingly (at least to me), this is quite a bit faster than
std::thread and std::barrier. GLM-4.5-AIR with 4 GPUs is now
at 105 t/s at zero context!
2025-12-25 15:20:37 +00:00
Iwan Kawrakow
4707b09137 Merge remote-tracking branch 'origin/main' into ik/nccl3_async 2025-12-25 07:57:23 +00:00
Kawrakow
03ed5f7096 Fix split mode graph when p2p is not enabled (#1091)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-25 08:55:08 +01:00
Kawrakow
41a8d05420 Reduce add improvemens without NCCL (#1088)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-25 08:44:24 +01:00
Iwan Kawrakow
6803cad2f3 Scheduler changes 2025-12-25 07:18:51 +00:00
Iwan Kawrakow
930c9f7006 Only do async for 4 or more backends
With 2 GPUs (so, 3 backends) not using async is slightly faster
2025-12-24 16:15:50 +00:00
Iwan Kawrakow
16d0dd794c Merge remote-tracking branch 'origin/main' into ik/nccl3_async 2025-12-24 15:28:13 +00:00
Kawrakow
fbb67fa2bd Fused norm (#1086)
* Adding fused_norm - same idea as fused_rms_norm

* Avoid computing the attention reduce op for cohere2

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-24 15:22:43 +01:00
Kawrakow
1ace5b7526 Be able to set reduce op data type for split mode "graph" (#1087)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-24 14:01:29 +01:00
firecoperana
2421a7e12b Webui: improve scroll and bug fixes (#1082)
* Webui: fix message scroll back due to setPending

smooth scroll

remove throttle

increase scroll margin

# Conflicts:
#	examples/server/public/index.html.gz
#	examples/server/webui/dist/index.html
#	examples/server/webui/src/utils/app.context.tsx

* webui: don't scroll to bottom when conversation changes or edit message

# Conflicts:
#	examples/server/public/index.html.gz
#	examples/server/webui/dist/index.html

* Webui: fix save config error

# Conflicts:
#	examples/server/public/index.html.gz
#	examples/server/webui/dist/index.html

* Webui: add api key to request model name

# Conflicts:
#	examples/server/public/index.html.gz
#	examples/server/webui/dist/index.html

* Update

* webui: fix loading dots display issue

# Conflicts:
#	examples/server/public/index.html.gz
#	examples/server/webui/dist/index.html
#	examples/server/webui/src/components/ChatMessage.tsx

* Webui: cancel scroll when user moves up

---------

Co-authored-by: firecoperana <firecoperana>
2025-12-24 12:30:26 +01:00
Kawrakow
1d7d0225a0 Graph parallel: the next generation (#1080)
* WIP: absorb adding input into std_attn and std_ffn

* WIP: NCCL infra

* WIP: add reduce and fake_cpy ops

* WIP

* WIP: graph appears to work, layer is broken

* WIP: Qwen3-MoE works with graph, layer still broken

* WIP: GLM-4.5 graph works

* WIP: fix sm layer (dense)

* WIP: fix sm layer (MoE)

* WIP: fast PP with bespoke 4-GPU NCCL

I guess, I'm not using NCCL the right way as PP is very
low with a single communicator group for 3 or more GPUs.
But if I create 4 communicator groups for pairs of GPUs
(0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting
1500 t/s for L3-70B on the 4x3090 system, which is
~20% better than the previous sm graph without NCCL.
But that cannot be the solution (I cannot be creating pairwise
communicators and associated logic for every possible number of GPUs).

* WIP: Cohere2

* Explicitely set device

* Bespoke 3-GPU case

* WIP

* Do not repeat get_rows multiple times

* Fix 3 GPUs

* OK, let's leave it in

* Implement the reduce op without NCCL available

* Be able to build without NCCL

cmake -DGGML_NCCL=OFF disables it

* Make --max-gpu work again

* Slightly better for 4 GPUs without NCCL

* Cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-24 08:31:48 +01:00
Iwan Kawrakow
ef30dd8834 This sync seems enough 2025-12-23 05:17:22 +00:00
Iwan Kawrakow
dc28cadb65 Simple async 2025-12-22 18:43:13 +00:00
Iwan Kawrakow
d4c23f1f89 OK, let's leave it in 2025-12-22 17:13:23 +00:00
Iwan Kawrakow
526ce7e050 Fix 3 GPUs 2025-12-22 16:43:42 +00:00
Iwan Kawrakow
1dd9bf7bcb Do not repeat get_rows multiple times 2025-12-22 13:57:57 +00:00
Iwan Kawrakow
f7cd271cad WIP 2025-12-22 13:36:58 +00:00
Iwan Kawrakow
12c8d3c650 Bespoke 3-GPU case 2025-12-22 11:16:24 +00:00
Iwan Kawrakow
0af67af9a5 Explicitely set device 2025-12-22 11:16:24 +00:00
Iwan Kawrakow
aa3f14b963 WIP: Cohere2 2025-12-22 11:16:24 +00:00
Iwan Kawrakow
d50ef0165e WIP: fast PP with bespoke 4-GPU NCCL
I guess, I'm not using NCCL the right way as PP is very
low with a single communicator group for 3 or more GPUs.
But if I create 4 communicator groups for pairs of GPUs
(0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting
1500 t/s for L3-70B on the 4x3090 system, which is
~20% better than the previous sm graph without NCCL.
But that cannot be the solution (I cannot be creating pairwise
communicators and associated logic for every possible number of GPUs).
2025-12-22 11:16:24 +00:00
Iwan Kawrakow
297f82ed02 WIP: fix sm layer (MoE) 2025-12-22 11:16:24 +00:00
Iwan Kawrakow
5db8262d94 WIP: fix sm layer (dense) 2025-12-22 11:16:24 +00:00
Iwan Kawrakow
1fe53d2002 WIP: GLM-4.5 graph works 2025-12-22 11:16:24 +00:00
Iwan Kawrakow
77bf735d10 WIP: Qwen3-MoE works with graph, layer still broken 2025-12-22 11:16:24 +00:00
Iwan Kawrakow
2b44a0d946 WIP: graph appears to work, layer is broken 2025-12-22 11:16:24 +00:00
Iwan Kawrakow
72fed6daaa WIP 2025-12-22 11:16:24 +00:00
Iwan Kawrakow
5e86e81a2d WIP: add reduce and fake_cpy ops 2025-12-22 11:16:24 +00:00
Iwan Kawrakow
655f6ce301 WIP: NCCL infra 2025-12-22 11:16:24 +00:00
Iwan Kawrakow
e2f325fad3 WIP: absorb adding input into std_attn and std_ffn 2025-12-22 11:16:24 +00:00
firecoperana
5562605076 server: exclude thinking tokens when finding the slot (#1079)
refactor find slot

enable by default

Fix load prompt

rename variables

Co-authored-by: firecoperana <firecoperana>
2025-12-22 09:46:45 +01:00
Kawrakow
21fc9322f9 cuda: set device to src device before p2p copy (#1073)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-17 12:50:34 +01:00
Nexes the Elder
7bb79eff48 add split-mode-graph-scheduling parameter (#1068)
Use -smgs or --split-mode-graph-scheduling in CLI to bypass the disabling of split mode graph scheduling when tensor overrides is used.

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>
2025-12-17 07:58:19 +01:00
Kawrakow
51eea5715f Better PP performance with split mode "graph" and 3+ GPUs (#1069)
* This should do the trick for PP

* Command line option to set max. extra VRAM that the scheduler can use

* Fix bug and cleanup

* Looks like with this change it is working with tensor overrides

* Nah, it is not working

* OK, this seems to be working

* Disable split scheduling with tensor overrides

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-17 07:40:25 +01:00
Kawrakow
8ccceff4e9 Much better TG speed with split mode "graph" (#1067)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-16 19:48:20 +01:00
firecoperana
756c3f8f43 Fix log issue for llama-cli (#1071)
Co-authored-by: firecoperana <firecoperana>
2025-12-16 18:12:16 +01:00
firecoperana
269cc761db Add back the fix for Kimi-K2 tool-call parsing issues (#1070)
Co-authored-by: firecoperana <firecoperana>
2025-12-16 14:44:47 +01:00
firecoperana
090f354d33 Refactor chat and server file (#1062)
* Add alternative log functions

* chat: fix int overflow, prevent size calculation in float/double (#17357)

* chat: fix int overflow, prevent size calculation in float/double

* Update common/chat.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* common : move all common_chat_parse_* to chat-parser.cpp. (#17481)

# Conflicts:
#	common/chat.cpp

* server: split server.cpp code into server/common/task/queue/context

* Fix compiler warning

* Clean up code

* common: use native MultiByteToWideChar

* move server prompt to server task

* Clean code

* delete utils.hpp

---------

Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: DAN™ <dranger003@gmail.com>
2025-12-15 08:27:20 +01:00
Kawrakow
0a36cea555 Use actual active number of layers when preparing splits (#1065)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-14 07:44:13 +01:00
Kawrakow
f90d1fdd06 Split mode "graph" for Cohere2 (#1061)
* This works and TG is descent, but PP is low

* Better

* Apply f_logit_scale before mul mat with output tensor

* This is better for PP: 600 t/s -> 700 t/s

* To not lose this again

* WIP

* Equal split

* WIP

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-13 20:30:08 +01:00
Kawrakow
5645be6cfc Fix sync logic (#1064)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-13 18:40:49 +01:00
Kawrakow
f667bd58b0 Undo sync reduction (#1063)
I'm finding issues for Qwen3-MoE

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-13 16:58:32 +01:00
Kawrakow
df02c39650 Do not use split mode graph scheduling if there are tensor overrides (#1060)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-12 14:48:38 +01:00
Kawrakow
cc14d4a3cc Fix overflow in offset calculation in mmq (#1059)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-12 14:31:06 +01:00