ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 17:20:01 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	29d323117c	Command line option to turn on async. Set to false by defualt for now	2025-12-27 06:24:01 +00:00
Iwan Kawrakow	07759f172c	Be more careful with having set the device before using a stream	2025-12-26 18:17:16 +00:00
Iwan Kawrakow	b79bf6c0ef	Merge remote-tracking branch 'origin/main' into ik/nccl3_async	2025-12-26 16:36:25 +00:00
Kawrakow	59d0022991	Graph parallel: better PP performance for 3 and more GPUs (#1092 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-26 17:35:27 +01:00
Iwan Kawrakow	443445579f	Set omp max active levels	2025-12-26 05:09:27 +00:00
Iwan Kawrakow	072cd216f4	Do not use OpenMP if there are tensor overrides	2025-12-25 17:06:46 +00:00
Iwan Kawrakow	197de25020	Use OpenMP if available Surprisingly (at least to me), this is quite a bit faster than std::thread and std::barrier. GLM-4.5-AIR with 4 GPUs is now at 105 t/s at zero context!	2025-12-25 15:20:37 +00:00
Iwan Kawrakow	4707b09137	Merge remote-tracking branch 'origin/main' into ik/nccl3_async	2025-12-25 07:57:23 +00:00
Kawrakow	03ed5f7096	Fix split mode graph when p2p is not enabled (#1091 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-25 08:55:08 +01:00
Kawrakow	41a8d05420	Reduce add improvemens without NCCL (#1088 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-25 08:44:24 +01:00
Iwan Kawrakow	6803cad2f3	Scheduler changes	2025-12-25 07:18:51 +00:00
Iwan Kawrakow	930c9f7006	Only do async for 4 or more backends With 2 GPUs (so, 3 backends) not using async is slightly faster	2025-12-24 16:15:50 +00:00
Iwan Kawrakow	16d0dd794c	Merge remote-tracking branch 'origin/main' into ik/nccl3_async	2025-12-24 15:28:13 +00:00
Kawrakow	fbb67fa2bd	Fused norm (#1086 ) * Adding fused_norm - same idea as fused_rms_norm * Avoid computing the attention reduce op for cohere2 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-24 15:22:43 +01:00
Kawrakow	1ace5b7526	Be able to set reduce op data type for split mode "graph" (#1087 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-24 14:01:29 +01:00
firecoperana	2421a7e12b	Webui: improve scroll and bug fixes (#1082 ) * Webui: fix message scroll back due to setPending smooth scroll remove throttle increase scroll margin # Conflicts: # examples/server/public/index.html.gz # examples/server/webui/dist/index.html # examples/server/webui/src/utils/app.context.tsx * webui: don't scroll to bottom when conversation changes or edit message # Conflicts: # examples/server/public/index.html.gz # examples/server/webui/dist/index.html * Webui: fix save config error # Conflicts: # examples/server/public/index.html.gz # examples/server/webui/dist/index.html * Webui: add api key to request model name # Conflicts: # examples/server/public/index.html.gz # examples/server/webui/dist/index.html * Update * webui: fix loading dots display issue # Conflicts: # examples/server/public/index.html.gz # examples/server/webui/dist/index.html # examples/server/webui/src/components/ChatMessage.tsx * Webui: cancel scroll when user moves up --------- Co-authored-by: firecoperana <firecoperana>	2025-12-24 12:30:26 +01:00
Kawrakow	1d7d0225a0	Graph parallel: the next generation (#1080 ) * WIP: absorb adding input into std_attn and std_ffn * WIP: NCCL infra * WIP: add reduce and fake_cpy ops * WIP * WIP: graph appears to work, layer is broken * WIP: Qwen3-MoE works with graph, layer still broken * WIP: GLM-4.5 graph works * WIP: fix sm layer (dense) * WIP: fix sm layer (MoE) * WIP: fast PP with bespoke 4-GPU NCCL I guess, I'm not using NCCL the right way as PP is very low with a single communicator group for 3 or more GPUs. But if I create 4 communicator groups for pairs of GPUs (0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting 1500 t/s for L3-70B on the 4x3090 system, which is ~20% better than the previous sm graph without NCCL. But that cannot be the solution (I cannot be creating pairwise communicators and associated logic for every possible number of GPUs). * WIP: Cohere2 * Explicitely set device * Bespoke 3-GPU case * WIP * Do not repeat get_rows multiple times * Fix 3 GPUs * OK, let's leave it in * Implement the reduce op without NCCL available * Be able to build without NCCL cmake -DGGML_NCCL=OFF disables it * Make --max-gpu work again * Slightly better for 4 GPUs without NCCL * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-24 08:31:48 +01:00
Iwan Kawrakow	ef30dd8834	This sync seems enough	2025-12-23 05:17:22 +00:00
Iwan Kawrakow	dc28cadb65	Simple async	2025-12-22 18:43:13 +00:00
Iwan Kawrakow	d4c23f1f89	OK, let's leave it in	2025-12-22 17:13:23 +00:00
Iwan Kawrakow	526ce7e050	Fix 3 GPUs	2025-12-22 16:43:42 +00:00
Iwan Kawrakow	1dd9bf7bcb	Do not repeat get_rows multiple times	2025-12-22 13:57:57 +00:00
Iwan Kawrakow	f7cd271cad	WIP	2025-12-22 13:36:58 +00:00
Iwan Kawrakow	12c8d3c650	Bespoke 3-GPU case	2025-12-22 11:16:24 +00:00
Iwan Kawrakow	0af67af9a5	Explicitely set device	2025-12-22 11:16:24 +00:00
Iwan Kawrakow	aa3f14b963	WIP: Cohere2	2025-12-22 11:16:24 +00:00
Iwan Kawrakow	d50ef0165e	WIP: fast PP with bespoke 4-GPU NCCL I guess, I'm not using NCCL the right way as PP is very low with a single communicator group for 3 or more GPUs. But if I create 4 communicator groups for pairs of GPUs (0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting 1500 t/s for L3-70B on the 4x3090 system, which is ~20% better than the previous sm graph without NCCL. But that cannot be the solution (I cannot be creating pairwise communicators and associated logic for every possible number of GPUs).	2025-12-22 11:16:24 +00:00
Iwan Kawrakow	297f82ed02	WIP: fix sm layer (MoE)	2025-12-22 11:16:24 +00:00
Iwan Kawrakow	5db8262d94	WIP: fix sm layer (dense)	2025-12-22 11:16:24 +00:00
Iwan Kawrakow	1fe53d2002	WIP: GLM-4.5 graph works	2025-12-22 11:16:24 +00:00
Iwan Kawrakow	77bf735d10	WIP: Qwen3-MoE works with graph, layer still broken	2025-12-22 11:16:24 +00:00
Iwan Kawrakow	2b44a0d946	WIP: graph appears to work, layer is broken	2025-12-22 11:16:24 +00:00
Iwan Kawrakow	72fed6daaa	WIP	2025-12-22 11:16:24 +00:00
Iwan Kawrakow	5e86e81a2d	WIP: add reduce and fake_cpy ops	2025-12-22 11:16:24 +00:00
Iwan Kawrakow	655f6ce301	WIP: NCCL infra	2025-12-22 11:16:24 +00:00
Iwan Kawrakow	e2f325fad3	WIP: absorb adding input into std_attn and std_ffn	2025-12-22 11:16:24 +00:00
firecoperana	5562605076	server: exclude thinking tokens when finding the slot (#1079 ) refactor find slot enable by default Fix load prompt rename variables Co-authored-by: firecoperana <firecoperana>	2025-12-22 09:46:45 +01:00
Kawrakow	21fc9322f9	cuda: set device to src device before p2p copy (#1073 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-17 12:50:34 +01:00
Nexes the Elder	7bb79eff48	add split-mode-graph-scheduling parameter (#1068 ) Use -smgs or --split-mode-graph-scheduling in CLI to bypass the disabling of split mode graph scheduling when tensor overrides is used. Co-authored-by: Kawrakow <iwankawrakow@gmail.com>	2025-12-17 07:58:19 +01:00
Kawrakow	51eea5715f	Better PP performance with split mode "graph" and 3+ GPUs (#1069 ) * This should do the trick for PP * Command line option to set max. extra VRAM that the scheduler can use * Fix bug and cleanup * Looks like with this change it is working with tensor overrides * Nah, it is not working * OK, this seems to be working * Disable split scheduling with tensor overrides --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-17 07:40:25 +01:00
Kawrakow	8ccceff4e9	Much better TG speed with split mode "graph" (#1067 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-16 19:48:20 +01:00
firecoperana	756c3f8f43	Fix log issue for llama-cli (#1071 ) Co-authored-by: firecoperana <firecoperana>	2025-12-16 18:12:16 +01:00
firecoperana	269cc761db	Add back the fix for Kimi-K2 tool-call parsing issues (#1070 ) Co-authored-by: firecoperana <firecoperana>	2025-12-16 14:44:47 +01:00
firecoperana	090f354d33	Refactor chat and server file (#1062 ) * Add alternative log functions * chat: fix int overflow, prevent size calculation in float/double (#17357) * chat: fix int overflow, prevent size calculation in float/double * Update common/chat.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * common : move all common_chat_parse_* to chat-parser.cpp. (#17481) # Conflicts: # common/chat.cpp * server: split server.cpp code into server/common/task/queue/context * Fix compiler warning * Clean up code * common: use native MultiByteToWideChar * move server prompt to server task * Clean code * delete utils.hpp --------- Co-authored-by: firecoperana <firecoperana> Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: DAN™ <dranger003@gmail.com>	2025-12-15 08:27:20 +01:00
Kawrakow	0a36cea555	Use actual active number of layers when preparing splits (#1065 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-14 07:44:13 +01:00
Kawrakow	f90d1fdd06	Split mode "graph" for Cohere2 (#1061 ) * This works and TG is descent, but PP is low * Better * Apply f_logit_scale before mul mat with output tensor * This is better for PP: 600 t/s -> 700 t/s * To not lose this again * WIP * Equal split * WIP --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-13 20:30:08 +01:00
Kawrakow	5645be6cfc	Fix sync logic (#1064 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-13 18:40:49 +01:00
Kawrakow	f667bd58b0	Undo sync reduction (#1063 ) I'm finding issues for Qwen3-MoE Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-13 16:58:32 +01:00
Kawrakow	df02c39650	Do not use split mode graph scheduling if there are tensor overrides (#1060 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-12 14:48:38 +01:00
Kawrakow	cc14d4a3cc	Fix overflow in offset calculation in mmq (#1059 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-12 14:31:06 +01:00

1 2 3 4 5 ...

4109 Commits