Iwan Kawrakow
1dd9bf7bcb
Do not repeat get_rows multiple times
2025-12-22 13:57:57 +00:00
Iwan Kawrakow
f7cd271cad
WIP
2025-12-22 13:36:58 +00:00
Iwan Kawrakow
12c8d3c650
Bespoke 3-GPU case
2025-12-22 11:16:24 +00:00
Iwan Kawrakow
0af67af9a5
Explicitely set device
2025-12-22 11:16:24 +00:00
Iwan Kawrakow
aa3f14b963
WIP: Cohere2
2025-12-22 11:16:24 +00:00
Iwan Kawrakow
d50ef0165e
WIP: fast PP with bespoke 4-GPU NCCL
...
I guess, I'm not using NCCL the right way as PP is very
low with a single communicator group for 3 or more GPUs.
But if I create 4 communicator groups for pairs of GPUs
(0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting
1500 t/s for L3-70B on the 4x3090 system, which is
~20% better than the previous sm graph without NCCL.
But that cannot be the solution (I cannot be creating pairwise
communicators and associated logic for every possible number of GPUs).
2025-12-22 11:16:24 +00:00
Iwan Kawrakow
297f82ed02
WIP: fix sm layer (MoE)
2025-12-22 11:16:24 +00:00
Iwan Kawrakow
5db8262d94
WIP: fix sm layer (dense)
2025-12-22 11:16:24 +00:00
Iwan Kawrakow
1fe53d2002
WIP: GLM-4.5 graph works
2025-12-22 11:16:24 +00:00
Iwan Kawrakow
77bf735d10
WIP: Qwen3-MoE works with graph, layer still broken
2025-12-22 11:16:24 +00:00
Iwan Kawrakow
2b44a0d946
WIP: graph appears to work, layer is broken
2025-12-22 11:16:24 +00:00
Iwan Kawrakow
72fed6daaa
WIP
2025-12-22 11:16:24 +00:00
Iwan Kawrakow
5e86e81a2d
WIP: add reduce and fake_cpy ops
2025-12-22 11:16:24 +00:00
Iwan Kawrakow
655f6ce301
WIP: NCCL infra
2025-12-22 11:16:24 +00:00
Iwan Kawrakow
e2f325fad3
WIP: absorb adding input into std_attn and std_ffn
2025-12-22 11:16:24 +00:00
firecoperana
5562605076
server: exclude thinking tokens when finding the slot ( #1079 )
...
refactor find slot
enable by default
Fix load prompt
rename variables
Co-authored-by: firecoperana <firecoperana>
2025-12-22 09:46:45 +01:00
Kawrakow
21fc9322f9
cuda: set device to src device before p2p copy ( #1073 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-12-17 12:50:34 +01:00
Nexes the Elder
7bb79eff48
add split-mode-graph-scheduling parameter ( #1068 )
...
Use -smgs or --split-mode-graph-scheduling in CLI to bypass the disabling of split mode graph scheduling when tensor overrides is used.
Co-authored-by: Kawrakow <iwankawrakow@gmail.com >
2025-12-17 07:58:19 +01:00
Kawrakow
51eea5715f
Better PP performance with split mode "graph" and 3+ GPUs ( #1069 )
...
* This should do the trick for PP
* Command line option to set max. extra VRAM that the scheduler can use
* Fix bug and cleanup
* Looks like with this change it is working with tensor overrides
* Nah, it is not working
* OK, this seems to be working
* Disable split scheduling with tensor overrides
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-12-17 07:40:25 +01:00
Kawrakow
8ccceff4e9
Much better TG speed with split mode "graph" ( #1067 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-12-16 19:48:20 +01:00
firecoperana
756c3f8f43
Fix log issue for llama-cli ( #1071 )
...
Co-authored-by: firecoperana <firecoperana>
2025-12-16 18:12:16 +01:00
firecoperana
269cc761db
Add back the fix for Kimi-K2 tool-call parsing issues ( #1070 )
...
Co-authored-by: firecoperana <firecoperana>
2025-12-16 14:44:47 +01:00
firecoperana
090f354d33
Refactor chat and server file ( #1062 )
...
* Add alternative log functions
* chat: fix int overflow, prevent size calculation in float/double (#17357 )
* chat: fix int overflow, prevent size calculation in float/double
* Update common/chat.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* common : move all common_chat_parse_* to chat-parser.cpp. (#17481 )
# Conflicts:
# common/chat.cpp
* server: split server.cpp code into server/common/task/queue/context
* Fix compiler warning
* Clean up code
* common: use native MultiByteToWideChar
* move server prompt to server task
* Clean code
* delete utils.hpp
---------
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co >
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
Co-authored-by: DAN™ <dranger003@gmail.com >
2025-12-15 08:27:20 +01:00
Kawrakow
0a36cea555
Use actual active number of layers when preparing splits ( #1065 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-12-14 07:44:13 +01:00
Kawrakow
f90d1fdd06
Split mode "graph" for Cohere2 ( #1061 )
...
* This works and TG is descent, but PP is low
* Better
* Apply f_logit_scale before mul mat with output tensor
* This is better for PP: 600 t/s -> 700 t/s
* To not lose this again
* WIP
* Equal split
* WIP
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-12-13 20:30:08 +01:00
Kawrakow
5645be6cfc
Fix sync logic ( #1064 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-12-13 18:40:49 +01:00
Kawrakow
f667bd58b0
Undo sync reduction ( #1063 )
...
I'm finding issues for Qwen3-MoE
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-12-13 16:58:32 +01:00
Kawrakow
df02c39650
Do not use split mode graph scheduling if there are tensor overrides ( #1060 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-12-12 14:48:38 +01:00
Kawrakow
cc14d4a3cc
Fix overflow in offset calculation in mmq ( #1059 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-12-12 14:31:06 +01:00
Kawrakow
b74fb479af
Be able to enable or disable P2P via command line argument ( #1058 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-12-12 13:36:42 +01:00
Kawrakow
0698501ae2
Slightly faster TG for split mode "graph" ( #1057 )
...
* Rearrange graph nodes
So that we can do graph portions that are the same on 2 or more
GPUs at the same time.
* Separate graph compute implementation for split mode graph
* This is better
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-12-12 07:54:37 +01:00
Kawrakow
6a0e72aeae
Fix #1055 ( #1056 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-12-11 14:44:32 +01:00
abc-nix
0feb046e6b
enable peer access (NVlink) ( #1050 )
...
* enable peer access for cuda
* Remove redundant loop
2025-12-11 08:31:56 +01:00
Kawrakow
59dba9f778
Fix the fix ( #1054 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-12-11 08:05:33 +01:00
Kawrakow
9484d150d8
Be able to set a max. number of GPUs to be used in split mode graph ( #1051 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-12-11 07:22:53 +01:00
Kawrakow
6a5a707ac0
Fix llama-bench - missing buffer override comparison operator ( #1053 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-12-11 07:21:06 +01:00
Kawrakow
00d939c811
Reduce back-end syncs ( #1049 )
...
* Reduce backend synchronization calls
* Also this
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-12-11 07:04:44 +01:00
i4TsU
62f907c663
QoL/bugfixes for llama-bench ( #1052 )
...
* include cuda-params and -ot in llama-bench output
* cleanup redundant type mapping
* fix wrong field name
* fix preexisting mistake in cuda_params help text (default value)
* fix preexisting mistake in kompute column header
* adjust code style to match current norms
* simplify/fix inverted columns
* fix field->value pairings/order
* remove dead field `f16_kv`
* sql printer deserves a way out too
* actually enable the new improvements....
2025-12-11 07:04:15 +01:00
Kawrakow
53f693a708
KV cache read/write for split mode "graph" ( #1048 )
...
* Handle split cache (write)
* Handle split cache (read)
* Fix writing the data twice
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-12-09 06:50:53 +01:00
Djip007
808ce4907c
Unroll for loop for repacked BF16 MATMUL ( #1047 )
...
see https://github.com/ikawrakow/ik_llama.cpp/discussions/1028 for
detail
2025-12-08 06:09:45 +01:00
Kawrakow
c9fcfb9a7a
Fix annoying compiler warnings ( #1042 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-12-06 09:59:07 +01:00
Kawrakow
87f6943e4b
Automatically disable CUDA graphs for split mode "graph" ( #1040 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-12-06 07:38:02 +01:00
Kawrakow
a3737f4296
CUDA: set current device in compute_forward ( #1039 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-12-05 16:47:50 +01:00
firecoperana
e741ec8a5d
CUDA: Fix FA for Pascal GPU ( #1036 )
...
Co-authored-by: firecoperana <firecoperana>
2025-12-05 16:42:14 +01:00
Kawrakow
f4def9b300
Don't split the output tensor ( #1038 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-12-05 15:56:53 +01:00
Kawrakow
b43801a2d2
Fix debug build ( #1037 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-12-05 14:06:22 +01:00
Kawrakow
b715342e82
K-cache Hadamard transforms (CUDA) ( #1034 )
...
* Hadamard transforms for K-cache on CUDA
* Minor
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-12-04 18:46:22 +01:00
Kawrakow
658ced0abd
Hadamard transforms for K-cache - CPU only ( #1033 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-12-04 06:51:11 +01:00
Kawrakow
08961718f3
Allow empty splits ( #1029 )
...
* Allow empty splits
* Fix type, add additional asserts
* Fix also output
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-12-03 13:52:41 +01:00
Kawrakow
bcb218102d
Use standard attention for Ministral3 ( #1032 )
...
Required adding the "temperature scaling" to the standard attention
implementation.
But in this way split mode "graph" is automatically supported.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-12-03 13:43:31 +01:00