ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-03-13 07:20:15 +00:00

Author	SHA1	Message	Date
firecoperana	c2eed98296	update description	2026-01-16 18:52:52 -06:00
firecoperana	7b4e848bf2	Webui: add text completions and adaptive_p sampling	2026-01-16 17:43:36 -06:00
Kawrakow	cb1063f6cd	Fix experts/shared experts split (#1147 )	2026-01-14 15:35:16 +02:00
hksdpc255	3a0b234669	Add context management to the MiroThinker template (simulate official agent behavior) (#1143 )	2026-01-13 18:08:59 +02:00
firecoperana	672df48ed1	server: keep logit bias unchanged when client does not set it (#1144 ) Co-authored-by: firecoperana <firecoperana>	2026-01-13 18:08:09 +02:00
Kawrakow	0adff91363	Make adding tensor overrides to llama-bench table optional (#1141 )	2026-01-13 11:08:13 +02:00
Kawrakow	9d9ed6a032	Add -sas, --scheduler-async to llama-bench (#1140 )	2026-01-13 10:23:50 +02:00
hksdpc255	e1c4c4a495	Fix Anthropic Messages API (#1136 ) * server: stop processing the prompt when client disconnects implement generator-based API for task results Update httplib.h to 0.27.0 Fix embedding error Stop prompt processing when disconnected * Port upstream https://github.com/ggml-org/llama.cpp/pull/18551 * add back anthropic * Fix merge issue caused by github webui --------- Co-authored-by: firecoperana <firecoperana>	2026-01-13 08:37:29 +02:00
Kawrakow	013831bba5	Fix compilation errors	2026-01-13 08:12:49 +02:00
Kawrakow	978202a754	Merge ffn_up and ffn_gate experts tensors (part 2) (#1139 ) * Add ability to merge up+gate exps to more models * We need to of course pass the merged tensor to build_ffn * All the others * Also Qwen3VL-MoE --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-13 08:07:52 +02:00
hksdpc255	54a1f68d32	Add chat parser for MiroThinker (#1138 ) * Add chat parser for MiroThinker * Add MiroThinker template * delete space	2026-01-13 08:07:12 +02:00
firecoperana	1a461525d5	server: stop processing the prompt when client disconnects (#1134 ) implement generator-based API for task results Update httplib.h to 0.27.0 Fix embedding error Stop prompt processing when disconnected Co-authored-by: firecoperana <firecoperana>	2026-01-13 07:56:59 +02:00
Kawrakow	d3e3ad40f9	Compiler warning and white space	2026-01-12 19:06:17 +02:00
Kawrakow	c03c2d7cc6	Merge ffn_up and ffn_gate experts tensors (#1137 ) * WIP - not working * WIP - not working * WIP - GPT-OSS working However, extremely stupid. The only way I could correctly repack the up/gate experts is to copy up and gate into host buffers, repack into another host buffer, copy back into the ffn_up_gate_exps tensor. This is going to be very slow for giant 500 GB models. My attempts to do this via a compute graph on the backend holding the tensors was unsuccessful. For GPT-OSS-20B I see ~6-7% better PP when using the original ik_llama.cpp fused_up_gate CUDA implementation, and ~10% when using the small batch size implementation. Other models are not working yet on CUDA as I need to fix the fused mul-unary implementation. * WIP * WIP - Qwen3-MoE (and hopefully all others) working But when I say here and in the previous commit "working", I mean PP is working. TG is still broken. * WIP: TG seems to be working * Minor * Add command line option to merge experts up/gate * Add merge up/gate command line parameter to llama-bench * Turn off merge_up_gate_exps if split mode graph It is not yet implemented * When no bias, allow merging up/gate with tensor overrides * Arghh, we need to increase the context size again * Cleanup	2026-01-12 18:30:53 +02:00
bndlfm	bf0c6c57bb	addOpenGLRunpath -> autoAddDriverRunpath in .devops/nix/package.nix (#1135 )	2026-01-12 15:16:37 +02:00
Kawrakow	738dc60b78	We don't need these	2026-01-10 15:32:21 +00:00
Kawrakow	c7348f6f55	Fix mla = 0 (#1130 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-10 10:34:30 +02:00
Kawrakow	c7dba35702	Update AUTHORS (#1129 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-10 08:10:21 +02:00
firecoperana	c03ee1a4d2	server: improve speed of speculative decoding (#1119 ) * server: improve speed of speculative decoding change logs rpc: add recompute spec dec fix * Fix n_batch_size not set to context size for draft model --------- Co-authored-by: firecoperana <firecoperana>	2026-01-10 08:01:22 +02:00
dungquixote42	52ad1c6421	Implement Adaptive-P Sampler (#1100 ) * initial implementation of adaptive-p sampler * explicitly mark candidates unsorted + cleanup qualifiers * cosmetic update * reorg prototypes * lockstep with mainline * add _impl for _init + reorg * add LLAMA_API to prototypes * update sharpness to 10 * lockstep: rng seed * delete llama_sampling member in llama_sampler_adaptive_p * fix LLAMA_API return type * lockstep: rng seed cont * actually correct implementation * lockstep: sorting behavior * const -> constexpr for known constants * add missing space * fix softmax usage in adaptive p sampler * cosmetic changes * implement do-not-sort version of softmax * simpify rng seed, add static to constexpr * refactor: remove iface + use shared rng + use actually original probabilities * adaptive-p: add dedicated rng back in * fix initial max_logit + add float vector to adaptive p sampler context + stochastic sampling * adaptive-p: fuse first softmax with transformation * adaptive-p: implement binary search selection * adaptive-p: update comment	2026-01-10 07:58:53 +02:00
Kawrakow	dd3c3f72f2	Fix split mode graph for GPT-OSS with partial offload (#1128 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-10 07:57:43 +02:00
Kawrakow	08a0da389c	Better VRAM utilization strategy for split mode graph (#1126 ) * Better VRAM utilization strategy for split mode graph * Fix assert when --max-gpu is less than available GPUs --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-09 13:36:02 +02:00
Kawrakow	8725d110d2	Fix data races in the reduce op (#1124 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-09 10:34:58 +02:00
Kawrakow	eaf2e1c15a	Split mode "graph" for Ernie-4.5-MoE (#1121 ) * Ernie-4.5-MoE split mode graph * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-08 16:46:41 +02:00
Kawrakow	0456aa47d3	Do not abort on NCCL initizalization failure (#1120 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-08 09:19:50 +02:00
Kawrakow	5ef98f8b0f	Split mode "graph" for GPT-OSS (#1118 ) * Split mode "graph" for GPT-OSS * Force split_mode_f16 to false --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-08 09:14:15 +02:00
firecoperana	9c1bef35e8	CUDA: compress-mode size (#1110 ) Co-authored-by: firecoperana <firecoperana>	2026-01-07 18:33:17 +02:00
Kawrakow	99fbd84971	Split mode "graph" for Hunyuan-MoE (#1116 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-07 13:38:08 +02:00
Kawrakow	ab1616767b	Enable up to 4 GPUs for Mimo2-Flash (#1115 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-07 09:40:29 +02:00
Kawrakow	a82dcbf3ee	Fix ring reduction (#1114 ) * Fix ring reduction * Actually enable it --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-07 08:01:31 +02:00
Kawrakow	54a513768c	Disable ring reduction for now (#1112 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-06 15:40:50 +02:00
Kawrakow	3c99284b67	Split mode 'graph' fpr Qwen3-VL (#1107 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-05 17:32:00 +02:00
Kawrakow	218dcc5727	Split mode graph for Qwen3 (#1106 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-05 14:31:36 +02:00
Kawrakow	419a397ce0	Graph parallel for Mimo-V2-Flash (#1105 ) * WIP * Cleanup * Set max_gpu to 2 for Mimo2 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-05 09:58:54 +02:00
Kawrakow	385fc14110	Fix race in CUDA FA for head sizes 192/128 (#1104 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-05 08:21:07 +02:00
Kawrakow	ab50c6cdcb	Mimo-V2-Flash support (#1096 ) * Mimo-2 support * Fix bug for head sizes not being the same It still does not solve the Mimo-2 quantized cache issue. * Fix quantized cache * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-05 08:00:01 +02:00
firecoperana	56dceefd6b	Fix windows build with CUDA (#1101 ) Co-authored-by: firecoperana <firecoperana>	2026-01-05 07:59:23 +02:00
hksdpc255	d7476a1b46	fix grammar for Kimi-K2 (#1103 ) * Update key-value separator and value end format * Sample grammar first if resampling --------- Co-authored-by: firecoperana <firecoperana>	2026-01-05 07:57:25 +02:00
Kawrakow	17a5a80946	Fix Windows build (#1097 )	2025-12-29 14:18:27 +01:00
Kawrakow	f878adbe90	Turn on graph reuse by default (#1094 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-27 08:27:16 +01:00
Kawrakow	519405dc97	Async compute graph evaluation (2 or more GPUs) (#1089 ) * WIP: absorb adding input into std_attn and std_ffn * WIP: NCCL infra * WIP: add reduce and fake_cpy ops * WIP * WIP: graph appears to work, layer is broken * WIP: Qwen3-MoE works with graph, layer still broken * WIP: GLM-4.5 graph works * WIP: fix sm layer (dense) * WIP: fix sm layer (MoE) * WIP: fast PP with bespoke 4-GPU NCCL I guess, I'm not using NCCL the right way as PP is very low with a single communicator group for 3 or more GPUs. But if I create 4 communicator groups for pairs of GPUs (0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting 1500 t/s for L3-70B on the 4x3090 system, which is ~20% better than the previous sm graph without NCCL. But that cannot be the solution (I cannot be creating pairwise communicators and associated logic for every possible number of GPUs). * WIP: Cohere2 * Explicitely set device * Bespoke 3-GPU case * WIP * Do not repeat get_rows multiple times * Fix 3 GPUs * OK, let's leave it in * Simple async * This sync seems enough * Only do async for 4 or more backends With 2 GPUs (so, 3 backends) not using async is slightly faster * Scheduler changes * Use OpenMP if available Surprisingly (at least to me), this is quite a bit faster than std::thread and std::barrier. GLM-4.5-AIR with 4 GPUs is now at 105 t/s at zero context! * Do not use OpenMP if there are tensor overrides * Set omp max active levels * Be more careful with having set the device before using a stream * Command line option to turn on async. Set to false by defualt for now --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-27 08:18:06 +01:00
Kawrakow	7146de451d	Be more careful with having set the device before using a stream (#1093 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-26 19:19:41 +01:00
Kawrakow	8687fca3ff	Graph parallel: better PP performance for 3 and more GPUs (#1092 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-26 17:35:27 +01:00
Kawrakow	a2ffceb235	Fix split mode graph when p2p is not enabled (#1091 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-25 08:55:08 +01:00
Kawrakow	3be3649db9	Reduce add improvemens without NCCL (#1088 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-25 08:44:24 +01:00
Kawrakow	ada5cc1523	Fused norm (#1086 ) * Adding fused_norm - same idea as fused_rms_norm * Avoid computing the attention reduce op for cohere2 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-24 15:22:43 +01:00
Kawrakow	5e64235d4c	Be able to set reduce op data type for split mode "graph" (#1087 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-24 14:01:29 +01:00
firecoperana	903377bc34	Webui: improve scroll and bug fixes (#1082 ) * Webui: fix message scroll back due to setPending smooth scroll remove throttle increase scroll margin # Conflicts: # examples/server/public/index.html.gz # examples/server/webui/dist/index.html # examples/server/webui/src/utils/app.context.tsx * webui: don't scroll to bottom when conversation changes or edit message # Conflicts: # examples/server/public/index.html.gz # examples/server/webui/dist/index.html * Webui: fix save config error # Conflicts: # examples/server/public/index.html.gz # examples/server/webui/dist/index.html * Webui: add api key to request model name # Conflicts: # examples/server/public/index.html.gz # examples/server/webui/dist/index.html * Update * webui: fix loading dots display issue # Conflicts: # examples/server/public/index.html.gz # examples/server/webui/dist/index.html # examples/server/webui/src/components/ChatMessage.tsx * Webui: cancel scroll when user moves up --------- Co-authored-by: firecoperana <firecoperana>	2025-12-24 12:30:26 +01:00
Kawrakow	0d7eb34185	Graph parallel: the next generation (#1080 ) * WIP: absorb adding input into std_attn and std_ffn * WIP: NCCL infra * WIP: add reduce and fake_cpy ops * WIP * WIP: graph appears to work, layer is broken * WIP: Qwen3-MoE works with graph, layer still broken * WIP: GLM-4.5 graph works * WIP: fix sm layer (dense) * WIP: fix sm layer (MoE) * WIP: fast PP with bespoke 4-GPU NCCL I guess, I'm not using NCCL the right way as PP is very low with a single communicator group for 3 or more GPUs. But if I create 4 communicator groups for pairs of GPUs (0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting 1500 t/s for L3-70B on the 4x3090 system, which is ~20% better than the previous sm graph without NCCL. But that cannot be the solution (I cannot be creating pairwise communicators and associated logic for every possible number of GPUs). * WIP: Cohere2 * Explicitely set device * Bespoke 3-GPU case * WIP * Do not repeat get_rows multiple times * Fix 3 GPUs * OK, let's leave it in * Implement the reduce op without NCCL available * Be able to build without NCCL cmake -DGGML_NCCL=OFF disables it * Make --max-gpu work again * Slightly better for 4 GPUs without NCCL * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-24 08:31:48 +01:00
firecoperana	2a633c4357	server: exclude thinking tokens when finding the slot (#1079 ) refactor find slot enable by default Fix load prompt rename variables Co-authored-by: firecoperana <firecoperana>	2025-12-22 09:46:45 +01:00

1 2 3 4 5 ...

4122 Commits