ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-05-11 00:20:19 +00:00

Author	SHA1	Message	Date
Kawrakow	685df0e69d	Work buffer size	2026-01-31 16:10:23 +00:00
Kawrakow	2bf2fa8ba4	Better CPU FA thread strategy	2026-01-31 15:46:16 +00:00
Kawrakow	33308908db	Merge pull request #1211 from ikawrakow/ik/reduce_mla3_compute_buffer_size Reduce CUDA compute buffer size for mla=3	2026-01-31 14:24:14 +02:00
Kawrakow	b85a2a50d5	Reduce compute buffer size for mla=3	2026-01-31 10:43:05 +00:00
Kawrakow	373f043d41	Merge pull request #1208 from ikawrakow/ik/try_fix_1201	2026-01-30 23:12:07 +02:00
Kawrakow	4d13ae03b5	Also these other two places	2026-01-30 15:36:29 +00:00
Kawrakow	098b1a2e04	Fix MiniMax-M2 KV-cache loading/saving	2026-01-30 13:38:07 +00:00
Kawrakow	811f8c3393	Fix bug in the CPU flash attention implementation (#1206 )	2026-01-30 11:37:34 +02:00
Kawrakow	686fd1ebec	Use standard output calculation for MiniMax-M2 graph parallel (#1199 )	2026-01-29 09:06:40 +02:00
Kawrakow	f0c61adacc	Be able to set FA offset via command line argument (#1198 )	2026-01-29 08:56:47 +02:00
Kawrakow	02ae22388f	Apply offfset to KQ_max in CUDA flash attention (#1196 ) * Apply offfset to KQ_max in CUDA flash attention * Forgot to add to fattn-common.h	2026-01-29 07:27:53 +02:00
Kawrakow	68ed62447c	Split mode graph for Minimax-M2 (#1195 ) * Split mode graph for Minimax-M2 * Cleanup * Forgotten ffn_exp_probs_b	2026-01-29 07:27:06 +02:00
Kawrakow	68cd52e583	Much faster long context TG for Minimax-M2 (#1194 )	2026-01-28 10:43:11 +02:00
Kawrakow	f9b5420e6a	Much faster long-context TG for GLM-4.5/4.6/4.7/AIR (#1193 ) * This seems much better for GQA = 12 TG * Remove unused arguments	2026-01-28 10:27:14 +02:00
Kawrakow	69fdd041c1	Remove forgotten unused code	2026-01-26 12:54:21 +00:00
Kawrakow	65441c2385	Even better GLM-4.7-Flash long context TG performance (#1192 ) * Better FA for GLM-4.7-Flash * Adjust ncols for ADA_LOVELACE or better	2026-01-26 13:45:06 +02:00
Kawrakow	30381fc1fc	Faster hybrid inference when shared experts (#1191 )	2026-01-26 07:22:05 +02:00
Kawrakow	478b56871f	Faster long context TG on CUDA for GLM-4.5/4.6/4.7/AIR (part 2) (#1190 ) * This works * Make quantized KV cache work * Remove the glm45 graph building changes * Add condition	2026-01-26 07:21:47 +02:00
Kawrakow	28f8320f3a	Much faster rng sampling (#1187 )	2026-01-25 09:11:27 +02:00
Kawrakow	04beeffa4e	Faster long context TG on CUDA for GLM-4.5/4.6/4.7/AIR (#1183 ) * Similar hack to #1182 for GLM-4.5/6/7 * Refinements * Disable when the KV cache is not f16	2026-01-24 09:39:29 +02:00
Kawrakow	f0fb76da64	Better GLM-4.7-Flash long context TG performance (#1182 ) * Better GLM-4.7-Flash long context TG performance * Handle quantized cache	2026-01-24 07:05:48 +02:00
Kawrakow	2a7cc09149	Remove llamafile remnants (#1179 )	2026-01-22 13:20:23 +02:00
Kawrakow	66caa42b53	Fix build with GGML_CUDA_GRAPHS=OFF	2026-01-22 10:46:57 +00:00
Kawrakow	851fda3509	Split mode graph: use CUDA graphs (#1177 ) * Use GUDA graphs also when theretensor overrides * Change graph key * This seems to work	2026-01-22 12:38:36 +02:00
Kawrakow	573e23679d	sweep_bench: set number of repetions (#1176 )	2026-01-22 12:28:30 +02:00
Kawrakow	101fe54797	CUDA graphs with tensor overrides (#1172 ) * Use GUDA graphs also when theretensor overrides * Change graph key	2026-01-22 12:28:11 +02:00
Kawrakow	1cb8cd534f	Fix build failure when OpenMP is not available (#1171 )	2026-01-22 12:26:23 +02:00
Kawrakow	77c18acc90	Fix non-contiguous batched cuBLAS (#1178 )	2026-01-22 12:25:05 +02:00
Kawrakow	987651e54c	Make comments more precise when experts gating function is missing (#1175 )	2026-01-21 09:12:40 +02:00
Kawrakow	9e07839ba3	Correct GLM-4.7-Flash gating function (#1174 ) * Correct GLM-4.7-Flash gating function * This is better	2026-01-21 07:53:18 +02:00
Kawrakow	6f1a69352f	Fuse experts bias in top_k_moe kernel (#1170 ) * GLM-4.7-Flash support * Model type * Make FA work for mla != 0 * Fuse bias in top_k_moe kernel if present	2026-01-20 15:38:51 +02:00
Kawrakow	996e77047a	Avoid ggml_get_rows if not necessary (#1160 ) * Copy reduce result to other GPUs if necessary * Avoid ggml_get_rows for TG * For the output ops use the result of the split that ran on the main GPU * More models	2026-01-20 15:38:21 +02:00
Kawrakow	132a01d25d	GLM-4.7-Flash support (#1168 ) * GLM-4.7-Flash support * Model type * Make FA work for mla != 0	2026-01-20 12:46:52 +02:00
Kawrakow	ef5f17940c	sampling: refactor sorting (#1166 ) * sampling: refactor sorting * Couldn't look at it without fixing it.	2026-01-19 16:48:54 +02:00
Kawrakow	98b30e5e81	Faster adaptive_p sampling (#1165 ) * A hopefully more efficient adaptive_p sampling * Once at it, lets fix the formatting too * More formatting * Hopefully better * This should be better * Correctly accumulate adaptive_p sampling time * AVX2	2026-01-19 16:03:09 +02:00
Kawrakow	fa58c20c42	A hopefully more efficient adaptive_p sampling (#1161 ) * A hopefully more efficient adaptive_p sampling * Once at it, lets fix the formatting too * More formatting * Correctly accumulate sampling time for adaptive_p	2026-01-19 15:01:55 +02:00
Kawrakow	6a5c180be9	Fix bf16 additions on CUDA arch < Ampere (#1164 ) * Fix bf16 additions on CUDA arch < Ampere * Prevent using NCCL if graph reduce type is bf16 and arch < AMPERE	2026-01-19 12:27:52 +02:00
Kawrakow	0c0b6e4b8b	Copy reduce result to other GPUs if necessary (#1156 )	2026-01-19 08:40:26 +02:00
dungquixote42	6dfbef27ec	Adaptive p: bugfix + optimization + refactor (#1155 ) * adaptive-p sampler: fix zeroed orig_probs bug and refactor - Fix bug where original probabilities were captured as zero by calculating them from logits in llama_prep_adaptive_p (new). - Replace vector with unordered_map to track candidate probabilities, filtering for relevance via logit delta (16.6f). - Standardize API naming: llama_<action/verb>_<focus/name/topic>_<extra/info> - Update function signatures to follow most other samplers. * resolve merge bug * adaptive-p: revert reordering function definitions	2026-01-18 08:26:06 +02:00
firecoperana	d71a3ec315	Server: refactor and rename functions (#1151 ) * Server: rename functions and refactor code rename functions refactor update slots rename params_base rename timings * change * Revert kv cache name changes * Revert 2 * fix test build error --------- Co-authored-by: firecoperana <firecoperana>	2026-01-18 08:16:57 +02:00
Kawrakow	7024fdbc72	Additional graph reduce types for split mode graph (#1154 ) * WIP: add Q8_0 and BF16 as possible reduce types Does not work - there is a big somewhere * This finally works	2026-01-18 08:02:49 +02:00
firecoperana	ee463b079e	Webui: add text completions and adaptive_p sampling (#1153 ) * Webui: add text completions and adaptive_p sampling * update description --------- Co-authored-by: firecoperana <firecoperana>	2026-01-17 08:37:07 +02:00
Kawrakow	709e1a5375	Fixing split mode graph with many GPUs (#1152 ) * Attempt to fix the many GPU issue in split mode graph * WIP: this seems more stable Still hanging after a while if I try to use all 7 GPUs * Reenable OpenMP in scheduler async Seems solid up to 4 GPUs. It did hang with --max-gpu 6. * printf cleanup	2026-01-17 08:05:24 +02:00
Kawrakow	cb1063f6cd	Fix experts/shared experts split (#1147 )	2026-01-14 15:35:16 +02:00
hksdpc255	3a0b234669	Add context management to the MiroThinker template (simulate official agent behavior) (#1143 )	2026-01-13 18:08:59 +02:00
firecoperana	672df48ed1	server: keep logit bias unchanged when client does not set it (#1144 ) Co-authored-by: firecoperana <firecoperana>	2026-01-13 18:08:09 +02:00
Kawrakow	0adff91363	Make adding tensor overrides to llama-bench table optional (#1141 )	2026-01-13 11:08:13 +02:00
Kawrakow	9d9ed6a032	Add -sas, --scheduler-async to llama-bench (#1140 )	2026-01-13 10:23:50 +02:00
hksdpc255	e1c4c4a495	Fix Anthropic Messages API (#1136 ) * server: stop processing the prompt when client disconnects implement generator-based API for task results Update httplib.h to 0.27.0 Fix embedding error Stop prompt processing when disconnected * Port upstream https://github.com/ggml-org/llama.cpp/pull/18551 * add back anthropic * Fix merge issue caused by github webui --------- Co-authored-by: firecoperana <firecoperana>	2026-01-13 08:37:29 +02:00
Kawrakow	013831bba5	Fix compilation errors	2026-01-13 08:12:49 +02:00

1 2 3 4 5 ...

4163 Commits