ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 17:20:01 +00:00

Author	SHA1	Message	Date
Kawrakow	996e77047a	Avoid ggml_get_rows if not necessary (#1160 ) * Copy reduce result to other GPUs if necessary * Avoid ggml_get_rows for TG * For the output ops use the result of the split that ran on the main GPU * More models	2026-01-20 15:38:21 +02:00
Kawrakow	132a01d25d	GLM-4.7-Flash support (#1168 ) * GLM-4.7-Flash support * Model type * Make FA work for mla != 0	2026-01-20 12:46:52 +02:00
Kawrakow	ef5f17940c	sampling: refactor sorting (#1166 ) * sampling: refactor sorting * Couldn't look at it without fixing it.	2026-01-19 16:48:54 +02:00
Kawrakow	98b30e5e81	Faster adaptive_p sampling (#1165 ) * A hopefully more efficient adaptive_p sampling * Once at it, lets fix the formatting too * More formatting * Hopefully better * This should be better * Correctly accumulate adaptive_p sampling time * AVX2	2026-01-19 16:03:09 +02:00
Kawrakow	fa58c20c42	A hopefully more efficient adaptive_p sampling (#1161 ) * A hopefully more efficient adaptive_p sampling * Once at it, lets fix the formatting too * More formatting * Correctly accumulate sampling time for adaptive_p	2026-01-19 15:01:55 +02:00
Kawrakow	0c0b6e4b8b	Copy reduce result to other GPUs if necessary (#1156 )	2026-01-19 08:40:26 +02:00
dungquixote42	6dfbef27ec	Adaptive p: bugfix + optimization + refactor (#1155 ) * adaptive-p sampler: fix zeroed orig_probs bug and refactor - Fix bug where original probabilities were captured as zero by calculating them from logits in llama_prep_adaptive_p (new). - Replace vector with unordered_map to track candidate probabilities, filtering for relevance via logit delta (16.6f). - Standardize API naming: llama_<action/verb>_<focus/name/topic>_<extra/info> - Update function signatures to follow most other samplers. * resolve merge bug * adaptive-p: revert reordering function definitions	2026-01-18 08:26:06 +02:00
firecoperana	d71a3ec315	Server: refactor and rename functions (#1151 ) * Server: rename functions and refactor code rename functions refactor update slots rename params_base rename timings * change * Revert kv cache name changes * Revert 2 * fix test build error --------- Co-authored-by: firecoperana <firecoperana>	2026-01-18 08:16:57 +02:00
Kawrakow	7024fdbc72	Additional graph reduce types for split mode graph (#1154 ) * WIP: add Q8_0 and BF16 as possible reduce types Does not work - there is a big somewhere * This finally works	2026-01-18 08:02:49 +02:00
Kawrakow	709e1a5375	Fixing split mode graph with many GPUs (#1152 ) * Attempt to fix the many GPU issue in split mode graph * WIP: this seems more stable Still hanging after a while if I try to use all 7 GPUs * Reenable OpenMP in scheduler async Seems solid up to 4 GPUs. It did hang with --max-gpu 6. * printf cleanup	2026-01-17 08:05:24 +02:00
Kawrakow	cb1063f6cd	Fix experts/shared experts split (#1147 )	2026-01-14 15:35:16 +02:00
Kawrakow	978202a754	Merge ffn_up and ffn_gate experts tensors (part 2) (#1139 ) * Add ability to merge up+gate exps to more models * We need to of course pass the merged tensor to build_ffn * All the others * Also Qwen3VL-MoE --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-13 08:07:52 +02:00
firecoperana	1a461525d5	server: stop processing the prompt when client disconnects (#1134 ) implement generator-based API for task results Update httplib.h to 0.27.0 Fix embedding error Stop prompt processing when disconnected Co-authored-by: firecoperana <firecoperana>	2026-01-13 07:56:59 +02:00
Kawrakow	c03c2d7cc6	Merge ffn_up and ffn_gate experts tensors (#1137 ) * WIP - not working * WIP - not working * WIP - GPT-OSS working However, extremely stupid. The only way I could correctly repack the up/gate experts is to copy up and gate into host buffers, repack into another host buffer, copy back into the ffn_up_gate_exps tensor. This is going to be very slow for giant 500 GB models. My attempts to do this via a compute graph on the backend holding the tensors was unsuccessful. For GPT-OSS-20B I see ~6-7% better PP when using the original ik_llama.cpp fused_up_gate CUDA implementation, and ~10% when using the small batch size implementation. Other models are not working yet on CUDA as I need to fix the fused mul-unary implementation. * WIP * WIP - Qwen3-MoE (and hopefully all others) working But when I say here and in the previous commit "working", I mean PP is working. TG is still broken. * WIP: TG seems to be working * Minor * Add command line option to merge experts up/gate * Add merge up/gate command line parameter to llama-bench * Turn off merge_up_gate_exps if split mode graph It is not yet implemented * When no bias, allow merging up/gate with tensor overrides * Arghh, we need to increase the context size again * Cleanup	2026-01-12 18:30:53 +02:00
Kawrakow	738dc60b78	We don't need these	2026-01-10 15:32:21 +00:00
dungquixote42	52ad1c6421	Implement Adaptive-P Sampler (#1100 ) * initial implementation of adaptive-p sampler * explicitly mark candidates unsorted + cleanup qualifiers * cosmetic update * reorg prototypes * lockstep with mainline * add _impl for _init + reorg * add LLAMA_API to prototypes * update sharpness to 10 * lockstep: rng seed * delete llama_sampling member in llama_sampler_adaptive_p * fix LLAMA_API return type * lockstep: rng seed cont * actually correct implementation * lockstep: sorting behavior * const -> constexpr for known constants * add missing space * fix softmax usage in adaptive p sampler * cosmetic changes * implement do-not-sort version of softmax * simpify rng seed, add static to constexpr * refactor: remove iface + use shared rng + use actually original probabilities * adaptive-p: add dedicated rng back in * fix initial max_logit + add float vector to adaptive p sampler context + stochastic sampling * adaptive-p: fuse first softmax with transformation * adaptive-p: implement binary search selection * adaptive-p: update comment	2026-01-10 07:58:53 +02:00
Kawrakow	dd3c3f72f2	Fix split mode graph for GPT-OSS with partial offload (#1128 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-10 07:57:43 +02:00
Kawrakow	08a0da389c	Better VRAM utilization strategy for split mode graph (#1126 ) * Better VRAM utilization strategy for split mode graph * Fix assert when --max-gpu is less than available GPUs --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-09 13:36:02 +02:00
Kawrakow	eaf2e1c15a	Split mode "graph" for Ernie-4.5-MoE (#1121 ) * Ernie-4.5-MoE split mode graph * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-08 16:46:41 +02:00
Kawrakow	5ef98f8b0f	Split mode "graph" for GPT-OSS (#1118 ) * Split mode "graph" for GPT-OSS * Force split_mode_f16 to false --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-08 09:14:15 +02:00
Kawrakow	99fbd84971	Split mode "graph" for Hunyuan-MoE (#1116 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-07 13:38:08 +02:00
Kawrakow	ab1616767b	Enable up to 4 GPUs for Mimo2-Flash (#1115 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-07 09:40:29 +02:00
Kawrakow	3c99284b67	Split mode 'graph' fpr Qwen3-VL (#1107 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-05 17:32:00 +02:00
Kawrakow	218dcc5727	Split mode graph for Qwen3 (#1106 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-05 14:31:36 +02:00
Kawrakow	419a397ce0	Graph parallel for Mimo-V2-Flash (#1105 ) * WIP * Cleanup * Set max_gpu to 2 for Mimo2 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-05 09:58:54 +02:00
Kawrakow	ab50c6cdcb	Mimo-V2-Flash support (#1096 ) * Mimo-2 support * Fix bug for head sizes not being the same It still does not solve the Mimo-2 quantized cache issue. * Fix quantized cache * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-05 08:00:01 +02:00
firecoperana	56dceefd6b	Fix windows build with CUDA (#1101 ) Co-authored-by: firecoperana <firecoperana>	2026-01-05 07:59:23 +02:00
Kawrakow	f878adbe90	Turn on graph reuse by default (#1094 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-27 08:27:16 +01:00
Kawrakow	519405dc97	Async compute graph evaluation (2 or more GPUs) (#1089 ) * WIP: absorb adding input into std_attn and std_ffn * WIP: NCCL infra * WIP: add reduce and fake_cpy ops * WIP * WIP: graph appears to work, layer is broken * WIP: Qwen3-MoE works with graph, layer still broken * WIP: GLM-4.5 graph works * WIP: fix sm layer (dense) * WIP: fix sm layer (MoE) * WIP: fast PP with bespoke 4-GPU NCCL I guess, I'm not using NCCL the right way as PP is very low with a single communicator group for 3 or more GPUs. But if I create 4 communicator groups for pairs of GPUs (0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting 1500 t/s for L3-70B on the 4x3090 system, which is ~20% better than the previous sm graph without NCCL. But that cannot be the solution (I cannot be creating pairwise communicators and associated logic for every possible number of GPUs). * WIP: Cohere2 * Explicitely set device * Bespoke 3-GPU case * WIP * Do not repeat get_rows multiple times * Fix 3 GPUs * OK, let's leave it in * Simple async * This sync seems enough * Only do async for 4 or more backends With 2 GPUs (so, 3 backends) not using async is slightly faster * Scheduler changes * Use OpenMP if available Surprisingly (at least to me), this is quite a bit faster than std::thread and std::barrier. GLM-4.5-AIR with 4 GPUs is now at 105 t/s at zero context! * Do not use OpenMP if there are tensor overrides * Set omp max active levels * Be more careful with having set the device before using a stream * Command line option to turn on async. Set to false by defualt for now --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-27 08:18:06 +01:00
Kawrakow	ada5cc1523	Fused norm (#1086 ) * Adding fused_norm - same idea as fused_rms_norm * Avoid computing the attention reduce op for cohere2 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-24 15:22:43 +01:00
Kawrakow	5e64235d4c	Be able to set reduce op data type for split mode "graph" (#1087 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-24 14:01:29 +01:00
Kawrakow	0d7eb34185	Graph parallel: the next generation (#1080 ) * WIP: absorb adding input into std_attn and std_ffn * WIP: NCCL infra * WIP: add reduce and fake_cpy ops * WIP * WIP: graph appears to work, layer is broken * WIP: Qwen3-MoE works with graph, layer still broken * WIP: GLM-4.5 graph works * WIP: fix sm layer (dense) * WIP: fix sm layer (MoE) * WIP: fast PP with bespoke 4-GPU NCCL I guess, I'm not using NCCL the right way as PP is very low with a single communicator group for 3 or more GPUs. But if I create 4 communicator groups for pairs of GPUs (0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting 1500 t/s for L3-70B on the 4x3090 system, which is ~20% better than the previous sm graph without NCCL. But that cannot be the solution (I cannot be creating pairwise communicators and associated logic for every possible number of GPUs). * WIP: Cohere2 * Explicitely set device * Bespoke 3-GPU case * WIP * Do not repeat get_rows multiple times * Fix 3 GPUs * OK, let's leave it in * Implement the reduce op without NCCL available * Be able to build without NCCL cmake -DGGML_NCCL=OFF disables it * Make --max-gpu work again * Slightly better for 4 GPUs without NCCL * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-24 08:31:48 +01:00
Nexes the Elder	d1dd45b4b9	add split-mode-graph-scheduling parameter (#1068 ) Use -smgs or --split-mode-graph-scheduling in CLI to bypass the disabling of split mode graph scheduling when tensor overrides is used. Co-authored-by: Kawrakow <iwankawrakow@gmail.com>	2025-12-17 07:58:19 +01:00
Kawrakow	5585ac2aa8	Better PP performance with split mode "graph" and 3+ GPUs (#1069 ) * This should do the trick for PP * Command line option to set max. extra VRAM that the scheduler can use * Fix bug and cleanup * Looks like with this change it is working with tensor overrides * Nah, it is not working * OK, this seems to be working * Disable split scheduling with tensor overrides --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-17 07:40:25 +01:00
Kawrakow	7b03c9dcef	Use actual active number of layers when preparing splits (#1065 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-14 07:44:13 +01:00
Kawrakow	d97a6de34d	Split mode "graph" for Cohere2 (#1061 ) * This works and TG is descent, but PP is low * Better * Apply f_logit_scale before mul mat with output tensor * This is better for PP: 600 t/s -> 700 t/s * To not lose this again * WIP * Equal split * WIP --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-13 20:30:08 +01:00
Kawrakow	093cc7c380	Do not use split mode graph scheduling if there are tensor overrides (#1060 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-12 14:48:38 +01:00
Kawrakow	f65fefa36c	Slightly faster TG for split mode "graph" (#1057 ) * Rearrange graph nodes So that we can do graph portions that are the same on 2 or more GPUs at the same time. * Separate graph compute implementation for split mode graph * This is better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-12 07:54:37 +01:00
Kawrakow	bf03f63c34	Fix #1055 (#1056 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-11 14:44:32 +01:00
Kawrakow	22863cf9c9	Be able to set a max. number of GPUs to be used in split mode graph (#1051 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-11 07:22:53 +01:00
Kawrakow	5fe3979951	KV cache read/write for split mode "graph" (#1048 ) * Handle split cache (write) * Handle split cache (read) * Fix writing the data twice --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-09 06:50:53 +01:00
Kawrakow	e02b71f89e	Automatically disable CUDA graphs for split mode "graph" (#1040 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-06 07:38:02 +01:00
Kawrakow	2125f68636	Don't split the output tensor (#1038 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-05 15:56:53 +01:00
Kawrakow	18fdd80eaf	Hadamard transforms for K-cache - CPU only (#1033 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-04 06:51:11 +01:00
Kawrakow	0581f90c0f	Allow empty splits (#1029 ) * Allow empty splits * Fix type, add additional asserts * Fix also output --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-03 13:52:41 +01:00
Kawrakow	90f36eb517	Use standard attention for Ministral3 (#1032 ) Required adding the "temperature scaling" to the standard attention implementation. But in this way split mode "graph" is automatically supported. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-03 13:43:31 +01:00
Kawrakow	cf20d0c756	Adding ministral3: this seems to work (#1030 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-03 11:01:21 +01:00
Kawrakow	92410bbd1e	Slightly better graph split strategy (#1026 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-02 18:50:52 +01:00
Kawrakow	a719349982	POC: CUDA tensor parallel (MoE models) (#1022 ) * Remove most of split mode row * WIP * WIP: also allocate the KV cache using tensor split * WIP: it runs with wrong result But it also looks like the backend scheduler is not going to help: * It copies mask and input positions to GPU 0 * => RoPE ops must run on GPU 0 * => To proceed attn evaluation, GPU 1 must wait for GPU 0 to finish its entire attn calculation * Same with FFN. The rms_norm gets scheduled on GPU 0. Hence, GPU 1 must wait for GPU 0 to finish its entore FFN calculation before it can start (as it needs to copy the result of rms_norm from GPU 0) * => Seems useless without writing a bespoke TP scheduling * WIP * This works, but it is slow * This is slightly better the graph is still not being computed in parallel. Why? Because the scheduler creates graph splits where the result of the computation on one GPU becomes an input for the other split. Hence, to trigger the computation on the second GPU one needs to wait for the computation on the first GPU to finish, even thiough the two can be done in parallel up to the sunchronization point. So, all that is left to do is to trick the scheduler to create to splits that can be done in parallel, and then have a graph split where the results get combined. * Playing games with the scheduler This change tricks it into doing the right thing^TM. Still quite a bit slower than split mode layer for the 8B LlaMA model. But for the 70B LlaMA it now beats split mode layer for TG: 28 t/s vs 24.4 t/s. PP is 627 t/s vs 744 t/s. In comparison, split mode "row" in mainline gets 484 t/s PP and 19.3 t/s TG. * Fix attn split Granularity for Wq, Wo is not just head size, but head size * gqa_ratio. Else the Wk, Wv tensors end up not being a multiple of the head size when we divide the split determined by Wo with the gqa_ratio. * Show memory used per device * Make it work with partial offload but no tensor overrides yet, just ngl < num_layers. * Allow for f16 source in fused_rms_norm * This results in faster PP. Now PP is faster than split mode layer for L3-70B. * Rename split mode "row" to split mode "graph" * Leave FFN partial results as f16 * WIP GLM4.5 - runs with wrong results * WIP GLM4.5 - this works PP is already better than split mode layer, but TG for zero context is kind of low - 60 vs 92 t/s. TG becomes better than split mode layer at around 20k tokens. PP at 26k tokens is 1.55X of sm layer. * Work around compiler bug It issues a warning that there is an extra semicolon outside of a function, but there isn't. If I remove the anonymous namespace and turn the functions inside into static, the warning disapears, so clearly a compiler bug. * Make graph reuse work with split mode graph * Remove more split mode row remnants * WIP tensor overrides Runs with wrong results, don't see where the issue could be. * This works but is slow Still does not work for row-interleaved quants * Slightly better * Slightly better * Row-interleaved quants work * Better * Minor * Guarad against using split mode "graph" for unsupported models * Guards against using merge_qkv with split mode "graph" * WIP split mode attn Works for LlaMA models, but not for GLM-4.5. Doesn't seem to improve performance, so I guess no point in trying to fix it. * Split mode graph for qwen3moe * Try to better distribute the splits --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-01 19:25:40 +01:00
firecoperana	15771072c7	RPC: support multiple devices including cpu (#1024 ) * RPC support multiple devices * rpc : update documentation (#16441) Update the README file to match the newly added functionality of exposing multiple devices from a single server. Co-authored-by: Diego Devesa <slarengh@gmail.com> # Conflicts: # examples/rpc/README.md * Remove memory settings * rpc : cache and reuse compute graphs (#15405) Store the last computed graph and reuse it when possible. Also do not return response from GRAPH_COMPUTE and assume it always completes successfully. If this this is not the case, the server closes the connection. This saves us a network round trip to the server. * Add -cpu to include cpu backend --------- Co-authored-by: firecoperana <firecoperana> Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com>	2025-11-30 18:48:02 +01:00

1 2 3 4 5 ...

295 Commits