ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-05-11 00:20:19 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	1dd9bf7bcb	Do not repeat get_rows multiple times	2025-12-22 13:57:57 +00:00
Iwan Kawrakow	f7cd271cad	WIP	2025-12-22 13:36:58 +00:00
Iwan Kawrakow	12c8d3c650	Bespoke 3-GPU case	2025-12-22 11:16:24 +00:00
Iwan Kawrakow	0af67af9a5	Explicitely set device	2025-12-22 11:16:24 +00:00
Iwan Kawrakow	aa3f14b963	WIP: Cohere2	2025-12-22 11:16:24 +00:00
Iwan Kawrakow	d50ef0165e	WIP: fast PP with bespoke 4-GPU NCCL I guess, I'm not using NCCL the right way as PP is very low with a single communicator group for 3 or more GPUs. But if I create 4 communicator groups for pairs of GPUs (0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting 1500 t/s for L3-70B on the 4x3090 system, which is ~20% better than the previous sm graph without NCCL. But that cannot be the solution (I cannot be creating pairwise communicators and associated logic for every possible number of GPUs).	2025-12-22 11:16:24 +00:00
Iwan Kawrakow	297f82ed02	WIP: fix sm layer (MoE)	2025-12-22 11:16:24 +00:00
Iwan Kawrakow	5db8262d94	WIP: fix sm layer (dense)	2025-12-22 11:16:24 +00:00
Iwan Kawrakow	1fe53d2002	WIP: GLM-4.5 graph works	2025-12-22 11:16:24 +00:00
Iwan Kawrakow	77bf735d10	WIP: Qwen3-MoE works with graph, layer still broken	2025-12-22 11:16:24 +00:00
Iwan Kawrakow	2b44a0d946	WIP: graph appears to work, layer is broken	2025-12-22 11:16:24 +00:00
Iwan Kawrakow	72fed6daaa	WIP	2025-12-22 11:16:24 +00:00
Iwan Kawrakow	5e86e81a2d	WIP: add reduce and fake_cpy ops	2025-12-22 11:16:24 +00:00
Iwan Kawrakow	655f6ce301	WIP: NCCL infra	2025-12-22 11:16:24 +00:00
Iwan Kawrakow	e2f325fad3	WIP: absorb adding input into std_attn and std_ffn	2025-12-22 11:16:24 +00:00
firecoperana	5562605076	server: exclude thinking tokens when finding the slot (#1079 ) refactor find slot enable by default Fix load prompt rename variables Co-authored-by: firecoperana <firecoperana>	2025-12-22 09:46:45 +01:00
Kawrakow	21fc9322f9	cuda: set device to src device before p2p copy (#1073 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-17 12:50:34 +01:00
Nexes the Elder	7bb79eff48	add split-mode-graph-scheduling parameter (#1068 ) Use -smgs or --split-mode-graph-scheduling in CLI to bypass the disabling of split mode graph scheduling when tensor overrides is used. Co-authored-by: Kawrakow <iwankawrakow@gmail.com>	2025-12-17 07:58:19 +01:00
Kawrakow	51eea5715f	Better PP performance with split mode "graph" and 3+ GPUs (#1069 ) * This should do the trick for PP * Command line option to set max. extra VRAM that the scheduler can use * Fix bug and cleanup * Looks like with this change it is working with tensor overrides * Nah, it is not working * OK, this seems to be working * Disable split scheduling with tensor overrides --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-17 07:40:25 +01:00
Kawrakow	8ccceff4e9	Much better TG speed with split mode "graph" (#1067 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-16 19:48:20 +01:00
firecoperana	756c3f8f43	Fix log issue for llama-cli (#1071 ) Co-authored-by: firecoperana <firecoperana>	2025-12-16 18:12:16 +01:00
firecoperana	269cc761db	Add back the fix for Kimi-K2 tool-call parsing issues (#1070 ) Co-authored-by: firecoperana <firecoperana>	2025-12-16 14:44:47 +01:00
firecoperana	090f354d33	Refactor chat and server file (#1062 ) * Add alternative log functions * chat: fix int overflow, prevent size calculation in float/double (#17357) * chat: fix int overflow, prevent size calculation in float/double * Update common/chat.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * common : move all common_chat_parse_* to chat-parser.cpp. (#17481) # Conflicts: # common/chat.cpp * server: split server.cpp code into server/common/task/queue/context * Fix compiler warning * Clean up code * common: use native MultiByteToWideChar * move server prompt to server task * Clean code * delete utils.hpp --------- Co-authored-by: firecoperana <firecoperana> Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: DAN™ <dranger003@gmail.com>	2025-12-15 08:27:20 +01:00
Kawrakow	0a36cea555	Use actual active number of layers when preparing splits (#1065 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-14 07:44:13 +01:00
Kawrakow	f90d1fdd06	Split mode "graph" for Cohere2 (#1061 ) * This works and TG is descent, but PP is low * Better * Apply f_logit_scale before mul mat with output tensor * This is better for PP: 600 t/s -> 700 t/s * To not lose this again * WIP * Equal split * WIP --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-13 20:30:08 +01:00
Kawrakow	5645be6cfc	Fix sync logic (#1064 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-13 18:40:49 +01:00
Kawrakow	f667bd58b0	Undo sync reduction (#1063 ) I'm finding issues for Qwen3-MoE Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-13 16:58:32 +01:00
Kawrakow	df02c39650	Do not use split mode graph scheduling if there are tensor overrides (#1060 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-12 14:48:38 +01:00
Kawrakow	cc14d4a3cc	Fix overflow in offset calculation in mmq (#1059 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-12 14:31:06 +01:00
Kawrakow	b74fb479af	Be able to enable or disable P2P via command line argument (#1058 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-12 13:36:42 +01:00
Kawrakow	0698501ae2	Slightly faster TG for split mode "graph" (#1057 ) * Rearrange graph nodes So that we can do graph portions that are the same on 2 or more GPUs at the same time. * Separate graph compute implementation for split mode graph * This is better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-12 07:54:37 +01:00
Kawrakow	6a0e72aeae	Fix #1055 (#1056 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-11 14:44:32 +01:00
abc-nix	0feb046e6b	enable peer access (NVlink) (#1050 ) * enable peer access for cuda * Remove redundant loop	2025-12-11 08:31:56 +01:00
Kawrakow	59dba9f778	Fix the fix (#1054 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-11 08:05:33 +01:00
Kawrakow	9484d150d8	Be able to set a max. number of GPUs to be used in split mode graph (#1051 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-11 07:22:53 +01:00
Kawrakow	6a5a707ac0	Fix llama-bench - missing buffer override comparison operator (#1053 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-11 07:21:06 +01:00
Kawrakow	00d939c811	Reduce back-end syncs (#1049 ) * Reduce backend synchronization calls * Also this --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-11 07:04:44 +01:00
i4TsU	62f907c663	QoL/bugfixes for llama-bench (#1052 ) * include cuda-params and -ot in llama-bench output * cleanup redundant type mapping * fix wrong field name * fix preexisting mistake in cuda_params help text (default value) * fix preexisting mistake in kompute column header * adjust code style to match current norms * simplify/fix inverted columns * fix field->value pairings/order * remove dead field `f16_kv` * sql printer deserves a way out too * actually enable the new improvements....	2025-12-11 07:04:15 +01:00
Kawrakow	53f693a708	KV cache read/write for split mode "graph" (#1048 ) * Handle split cache (write) * Handle split cache (read) * Fix writing the data twice --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-09 06:50:53 +01:00
Djip007	808ce4907c	Unroll for loop for repacked BF16 MATMUL (#1047 ) see https://github.com/ikawrakow/ik_llama.cpp/discussions/1028 for detail	2025-12-08 06:09:45 +01:00
Kawrakow	c9fcfb9a7a	Fix annoying compiler warnings (#1042 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-06 09:59:07 +01:00
Kawrakow	87f6943e4b	Automatically disable CUDA graphs for split mode "graph" (#1040 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-06 07:38:02 +01:00
Kawrakow	a3737f4296	CUDA: set current device in compute_forward (#1039 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-05 16:47:50 +01:00
firecoperana	e741ec8a5d	CUDA: Fix FA for Pascal GPU (#1036 ) Co-authored-by: firecoperana <firecoperana>	2025-12-05 16:42:14 +01:00
Kawrakow	f4def9b300	Don't split the output tensor (#1038 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-05 15:56:53 +01:00
Kawrakow	b43801a2d2	Fix debug build (#1037 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-05 14:06:22 +01:00
Kawrakow	b715342e82	K-cache Hadamard transforms (CUDA) (#1034 ) * Hadamard transforms for K-cache on CUDA * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-04 18:46:22 +01:00
Kawrakow	658ced0abd	Hadamard transforms for K-cache - CPU only (#1033 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-04 06:51:11 +01:00
Kawrakow	08961718f3	Allow empty splits (#1029 ) * Allow empty splits * Fix type, add additional asserts * Fix also output --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-03 13:52:41 +01:00
Kawrakow	bcb218102d	Use standard attention for Ministral3 (#1032 ) Required adding the "temperature scaling" to the standard attention implementation. But in this way split mode "graph" is automatically supported. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-03 13:43:31 +01:00

1 2 3 4 5 ...

4088 Commits