ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-05-11 08:30:19 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	bf3ff8ec41	Turn on graph reuse by default	2025-12-27 07:22:46 +00:00
Kawrakow	2fe098e938	Async compute graph evaluation (2 or more GPUs) (#1089 ) * WIP: absorb adding input into std_attn and std_ffn * WIP: NCCL infra * WIP: add reduce and fake_cpy ops * WIP * WIP: graph appears to work, layer is broken * WIP: Qwen3-MoE works with graph, layer still broken * WIP: GLM-4.5 graph works * WIP: fix sm layer (dense) * WIP: fix sm layer (MoE) * WIP: fast PP with bespoke 4-GPU NCCL I guess, I'm not using NCCL the right way as PP is very low with a single communicator group for 3 or more GPUs. But if I create 4 communicator groups for pairs of GPUs (0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting 1500 t/s for L3-70B on the 4x3090 system, which is ~20% better than the previous sm graph without NCCL. But that cannot be the solution (I cannot be creating pairwise communicators and associated logic for every possible number of GPUs). * WIP: Cohere2 * Explicitely set device * Bespoke 3-GPU case * WIP * Do not repeat get_rows multiple times * Fix 3 GPUs * OK, let's leave it in * Simple async * This sync seems enough * Only do async for 4 or more backends With 2 GPUs (so, 3 backends) not using async is slightly faster * Scheduler changes * Use OpenMP if available Surprisingly (at least to me), this is quite a bit faster than std::thread and std::barrier. GLM-4.5-AIR with 4 GPUs is now at 105 t/s at zero context! * Do not use OpenMP if there are tensor overrides * Set omp max active levels * Be more careful with having set the device before using a stream * Command line option to turn on async. Set to false by defualt for now --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-27 08:18:06 +01:00
Kawrakow	f7923739cc	Be more careful with having set the device before using a stream (#1093 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-26 19:19:41 +01:00
Kawrakow	59d0022991	Graph parallel: better PP performance for 3 and more GPUs (#1092 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-26 17:35:27 +01:00
Kawrakow	03ed5f7096	Fix split mode graph when p2p is not enabled (#1091 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-25 08:55:08 +01:00
Kawrakow	41a8d05420	Reduce add improvemens without NCCL (#1088 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-25 08:44:24 +01:00
Kawrakow	fbb67fa2bd	Fused norm (#1086 ) * Adding fused_norm - same idea as fused_rms_norm * Avoid computing the attention reduce op for cohere2 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-24 15:22:43 +01:00
Kawrakow	1ace5b7526	Be able to set reduce op data type for split mode "graph" (#1087 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-24 14:01:29 +01:00
firecoperana	2421a7e12b	Webui: improve scroll and bug fixes (#1082 ) * Webui: fix message scroll back due to setPending smooth scroll remove throttle increase scroll margin # Conflicts: # examples/server/public/index.html.gz # examples/server/webui/dist/index.html # examples/server/webui/src/utils/app.context.tsx * webui: don't scroll to bottom when conversation changes or edit message # Conflicts: # examples/server/public/index.html.gz # examples/server/webui/dist/index.html * Webui: fix save config error # Conflicts: # examples/server/public/index.html.gz # examples/server/webui/dist/index.html * Webui: add api key to request model name # Conflicts: # examples/server/public/index.html.gz # examples/server/webui/dist/index.html * Update * webui: fix loading dots display issue # Conflicts: # examples/server/public/index.html.gz # examples/server/webui/dist/index.html # examples/server/webui/src/components/ChatMessage.tsx * Webui: cancel scroll when user moves up --------- Co-authored-by: firecoperana <firecoperana>	2025-12-24 12:30:26 +01:00
Kawrakow	1d7d0225a0	Graph parallel: the next generation (#1080 ) * WIP: absorb adding input into std_attn and std_ffn * WIP: NCCL infra * WIP: add reduce and fake_cpy ops * WIP * WIP: graph appears to work, layer is broken * WIP: Qwen3-MoE works with graph, layer still broken * WIP: GLM-4.5 graph works * WIP: fix sm layer (dense) * WIP: fix sm layer (MoE) * WIP: fast PP with bespoke 4-GPU NCCL I guess, I'm not using NCCL the right way as PP is very low with a single communicator group for 3 or more GPUs. But if I create 4 communicator groups for pairs of GPUs (0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting 1500 t/s for L3-70B on the 4x3090 system, which is ~20% better than the previous sm graph without NCCL. But that cannot be the solution (I cannot be creating pairwise communicators and associated logic for every possible number of GPUs). * WIP: Cohere2 * Explicitely set device * Bespoke 3-GPU case * WIP * Do not repeat get_rows multiple times * Fix 3 GPUs * OK, let's leave it in * Implement the reduce op without NCCL available * Be able to build without NCCL cmake -DGGML_NCCL=OFF disables it * Make --max-gpu work again * Slightly better for 4 GPUs without NCCL * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-24 08:31:48 +01:00
firecoperana	5562605076	server: exclude thinking tokens when finding the slot (#1079 ) refactor find slot enable by default Fix load prompt rename variables Co-authored-by: firecoperana <firecoperana>	2025-12-22 09:46:45 +01:00
Kawrakow	21fc9322f9	cuda: set device to src device before p2p copy (#1073 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-17 12:50:34 +01:00
Nexes the Elder	7bb79eff48	add split-mode-graph-scheduling parameter (#1068 ) Use -smgs or --split-mode-graph-scheduling in CLI to bypass the disabling of split mode graph scheduling when tensor overrides is used. Co-authored-by: Kawrakow <iwankawrakow@gmail.com>	2025-12-17 07:58:19 +01:00
Kawrakow	51eea5715f	Better PP performance with split mode "graph" and 3+ GPUs (#1069 ) * This should do the trick for PP * Command line option to set max. extra VRAM that the scheduler can use * Fix bug and cleanup * Looks like with this change it is working with tensor overrides * Nah, it is not working * OK, this seems to be working * Disable split scheduling with tensor overrides --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-17 07:40:25 +01:00
Kawrakow	8ccceff4e9	Much better TG speed with split mode "graph" (#1067 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-16 19:48:20 +01:00
firecoperana	756c3f8f43	Fix log issue for llama-cli (#1071 ) Co-authored-by: firecoperana <firecoperana>	2025-12-16 18:12:16 +01:00
firecoperana	269cc761db	Add back the fix for Kimi-K2 tool-call parsing issues (#1070 ) Co-authored-by: firecoperana <firecoperana>	2025-12-16 14:44:47 +01:00
firecoperana	090f354d33	Refactor chat and server file (#1062 ) * Add alternative log functions * chat: fix int overflow, prevent size calculation in float/double (#17357) * chat: fix int overflow, prevent size calculation in float/double * Update common/chat.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * common : move all common_chat_parse_* to chat-parser.cpp. (#17481) # Conflicts: # common/chat.cpp * server: split server.cpp code into server/common/task/queue/context * Fix compiler warning * Clean up code * common: use native MultiByteToWideChar * move server prompt to server task * Clean code * delete utils.hpp --------- Co-authored-by: firecoperana <firecoperana> Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: DAN™ <dranger003@gmail.com>	2025-12-15 08:27:20 +01:00
Kawrakow	0a36cea555	Use actual active number of layers when preparing splits (#1065 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-14 07:44:13 +01:00
Kawrakow	f90d1fdd06	Split mode "graph" for Cohere2 (#1061 ) * This works and TG is descent, but PP is low * Better * Apply f_logit_scale before mul mat with output tensor * This is better for PP: 600 t/s -> 700 t/s * To not lose this again * WIP * Equal split * WIP --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-13 20:30:08 +01:00
Kawrakow	5645be6cfc	Fix sync logic (#1064 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-13 18:40:49 +01:00
Kawrakow	f667bd58b0	Undo sync reduction (#1063 ) I'm finding issues for Qwen3-MoE Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-13 16:58:32 +01:00
Kawrakow	df02c39650	Do not use split mode graph scheduling if there are tensor overrides (#1060 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-12 14:48:38 +01:00
Kawrakow	cc14d4a3cc	Fix overflow in offset calculation in mmq (#1059 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-12 14:31:06 +01:00
Kawrakow	b74fb479af	Be able to enable or disable P2P via command line argument (#1058 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-12 13:36:42 +01:00
Kawrakow	0698501ae2	Slightly faster TG for split mode "graph" (#1057 ) * Rearrange graph nodes So that we can do graph portions that are the same on 2 or more GPUs at the same time. * Separate graph compute implementation for split mode graph * This is better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-12 07:54:37 +01:00
Kawrakow	6a0e72aeae	Fix #1055 (#1056 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-11 14:44:32 +01:00
abc-nix	0feb046e6b	enable peer access (NVlink) (#1050 ) * enable peer access for cuda * Remove redundant loop	2025-12-11 08:31:56 +01:00
Kawrakow	59dba9f778	Fix the fix (#1054 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-11 08:05:33 +01:00
Kawrakow	9484d150d8	Be able to set a max. number of GPUs to be used in split mode graph (#1051 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-11 07:22:53 +01:00
Kawrakow	6a5a707ac0	Fix llama-bench - missing buffer override comparison operator (#1053 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-11 07:21:06 +01:00
Kawrakow	00d939c811	Reduce back-end syncs (#1049 ) * Reduce backend synchronization calls * Also this --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-11 07:04:44 +01:00
i4TsU	62f907c663	QoL/bugfixes for llama-bench (#1052 ) * include cuda-params and -ot in llama-bench output * cleanup redundant type mapping * fix wrong field name * fix preexisting mistake in cuda_params help text (default value) * fix preexisting mistake in kompute column header * adjust code style to match current norms * simplify/fix inverted columns * fix field->value pairings/order * remove dead field `f16_kv` * sql printer deserves a way out too * actually enable the new improvements....	2025-12-11 07:04:15 +01:00
Kawrakow	53f693a708	KV cache read/write for split mode "graph" (#1048 ) * Handle split cache (write) * Handle split cache (read) * Fix writing the data twice --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-09 06:50:53 +01:00
Djip007	808ce4907c	Unroll for loop for repacked BF16 MATMUL (#1047 ) see https://github.com/ikawrakow/ik_llama.cpp/discussions/1028 for detail	2025-12-08 06:09:45 +01:00
Kawrakow	c9fcfb9a7a	Fix annoying compiler warnings (#1042 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-06 09:59:07 +01:00
Kawrakow	87f6943e4b	Automatically disable CUDA graphs for split mode "graph" (#1040 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-06 07:38:02 +01:00
Kawrakow	a3737f4296	CUDA: set current device in compute_forward (#1039 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-05 16:47:50 +01:00
firecoperana	e741ec8a5d	CUDA: Fix FA for Pascal GPU (#1036 ) Co-authored-by: firecoperana <firecoperana>	2025-12-05 16:42:14 +01:00
Kawrakow	f4def9b300	Don't split the output tensor (#1038 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-05 15:56:53 +01:00
Kawrakow	b43801a2d2	Fix debug build (#1037 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-05 14:06:22 +01:00
Kawrakow	b715342e82	K-cache Hadamard transforms (CUDA) (#1034 ) * Hadamard transforms for K-cache on CUDA * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-04 18:46:22 +01:00
Kawrakow	658ced0abd	Hadamard transforms for K-cache - CPU only (#1033 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-04 06:51:11 +01:00
Kawrakow	08961718f3	Allow empty splits (#1029 ) * Allow empty splits * Fix type, add additional asserts * Fix also output --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-03 13:52:41 +01:00
Kawrakow	bcb218102d	Use standard attention for Ministral3 (#1032 ) Required adding the "temperature scaling" to the standard attention implementation. But in this way split mode "graph" is automatically supported. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-03 13:43:31 +01:00
Kawrakow	74c56067b4	Fix bug in ggml_cuda_op_scale_tensor (#1031 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-03 11:32:19 +01:00
Kawrakow	fcc2df11df	Adding ministral3: this seems to work (#1030 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-03 11:01:21 +01:00
Kawrakow	40097e7e41	Slightly better graph split strategy (#1026 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-02 18:50:52 +01:00
Kawrakow	8e3041b263	POC: CUDA tensor parallel (MoE models) (#1022 ) * Remove most of split mode row * WIP * WIP: also allocate the KV cache using tensor split * WIP: it runs with wrong result But it also looks like the backend scheduler is not going to help: * It copies mask and input positions to GPU 0 * => RoPE ops must run on GPU 0 * => To proceed attn evaluation, GPU 1 must wait for GPU 0 to finish its entire attn calculation * Same with FFN. The rms_norm gets scheduled on GPU 0. Hence, GPU 1 must wait for GPU 0 to finish its entore FFN calculation before it can start (as it needs to copy the result of rms_norm from GPU 0) * => Seems useless without writing a bespoke TP scheduling * WIP * This works, but it is slow * This is slightly better the graph is still not being computed in parallel. Why? Because the scheduler creates graph splits where the result of the computation on one GPU becomes an input for the other split. Hence, to trigger the computation on the second GPU one needs to wait for the computation on the first GPU to finish, even thiough the two can be done in parallel up to the sunchronization point. So, all that is left to do is to trick the scheduler to create to splits that can be done in parallel, and then have a graph split where the results get combined. * Playing games with the scheduler This change tricks it into doing the right thing^TM. Still quite a bit slower than split mode layer for the 8B LlaMA model. But for the 70B LlaMA it now beats split mode layer for TG: 28 t/s vs 24.4 t/s. PP is 627 t/s vs 744 t/s. In comparison, split mode "row" in mainline gets 484 t/s PP and 19.3 t/s TG. * Fix attn split Granularity for Wq, Wo is not just head size, but head size * gqa_ratio. Else the Wk, Wv tensors end up not being a multiple of the head size when we divide the split determined by Wo with the gqa_ratio. * Show memory used per device * Make it work with partial offload but no tensor overrides yet, just ngl < num_layers. * Allow for f16 source in fused_rms_norm * This results in faster PP. Now PP is faster than split mode layer for L3-70B. * Rename split mode "row" to split mode "graph" * Leave FFN partial results as f16 * WIP GLM4.5 - runs with wrong results * WIP GLM4.5 - this works PP is already better than split mode layer, but TG for zero context is kind of low - 60 vs 92 t/s. TG becomes better than split mode layer at around 20k tokens. PP at 26k tokens is 1.55X of sm layer. * Work around compiler bug It issues a warning that there is an extra semicolon outside of a function, but there isn't. If I remove the anonymous namespace and turn the functions inside into static, the warning disapears, so clearly a compiler bug. * Make graph reuse work with split mode graph * Remove more split mode row remnants * WIP tensor overrides Runs with wrong results, don't see where the issue could be. * This works but is slow Still does not work for row-interleaved quants * Slightly better * Slightly better * Row-interleaved quants work * Better * Minor * Guarad against using split mode "graph" for unsupported models * Guards against using merge_qkv with split mode "graph" * WIP split mode attn Works for LlaMA models, but not for GLM-4.5. Doesn't seem to improve performance, so I guess no point in trying to fix it. * Split mode graph for qwen3moe * Try to better distribute the splits --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-01 19:25:40 +01:00
Kawrakow	507f3a4d14	Fix build with RPC not enabled (#1025 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-30 19:04:54 +01:00

1 2 3 4 5 ...

4083 Commits