ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-07 23:10:10 +00:00

Author	SHA1	Message	Date
usrlocalben	e5622a2e91	Fix Phi-3, Phi-4 (#1226 ) * fix phi3 tensor setup * avoid SWA for Phi-4	2026-02-04 11:57:50 +02:00
Kawrakow	f8acfc2bf0	Better CUDA TG for GQA = 10 (#1221 ) * Better CUDA TG for GQA = 10 * Cleanup	2026-02-03 09:18:46 +02:00
firecoperana	7e8d444033	llama : add token matching support to llama-grammar (#1220 ) * llama : add token matching support to llama-grammar llama : add token matching support to llama-grammar (#17816) common/grammar : replace problematic backtracking regex `[\s\S]` (#18342) disable tests and fix warnings --------- Co-authored-by: firecoperana <firecoperana>	2026-02-03 07:57:17 +02:00
saood06	8ba7e2b40c	Add support for Seed-OSS (#1218 ) * it compiles * Fix constants.py	2026-02-03 07:39:45 +02:00
dungquixote42	b86d8024a5	Adaptive p: history update fix + temp as flag (#1213 ) * adaptive_p: fix history update + use current probability for high temp * adaptive_p: fix history update bug, update with current probability if temp is high * replace temp-as-signal with server argument * adaptive_p: rename ema_w_cur_p to updt_w_cur * delete test code	2026-02-03 07:36:12 +02:00
Kawrakow	589d80f677	Fix CPU FA work buffer size (#1216 )	2026-02-02 12:39:41 +02:00
Kawrakow	49ba462f22	Merge pull request #1215 from ikawrakow/ik/cpu_fa_dont_repack_tg Do not repack q8_0 for batch sizes less than 8	2026-02-02 12:12:34 +02:00
Kawrakow	d5498c4467	Do not repack q8_0 for batch sizes less than 8	2026-02-02 09:07:45 +00:00
Kawrakow	a527b5af25	Merge pull request #1212 from ikawrakow/ik/better_cpu_fa_thread_strategy Better long-context CPU performance	2026-02-02 10:58:01 +02:00
Kawrakow	685df0e69d	Work buffer size	2026-01-31 16:10:23 +00:00
Kawrakow	2bf2fa8ba4	Better CPU FA thread strategy	2026-01-31 15:46:16 +00:00
Kawrakow	33308908db	Merge pull request #1211 from ikawrakow/ik/reduce_mla3_compute_buffer_size Reduce CUDA compute buffer size for mla=3	2026-01-31 14:24:14 +02:00
Kawrakow	b85a2a50d5	Reduce compute buffer size for mla=3	2026-01-31 10:43:05 +00:00
Kawrakow	373f043d41	Merge pull request #1208 from ikawrakow/ik/try_fix_1201	2026-01-30 23:12:07 +02:00
Kawrakow	4d13ae03b5	Also these other two places	2026-01-30 15:36:29 +00:00
Kawrakow	098b1a2e04	Fix MiniMax-M2 KV-cache loading/saving	2026-01-30 13:38:07 +00:00
Kawrakow	811f8c3393	Fix bug in the CPU flash attention implementation (#1206 )	2026-01-30 11:37:34 +02:00
Kawrakow	686fd1ebec	Use standard output calculation for MiniMax-M2 graph parallel (#1199 )	2026-01-29 09:06:40 +02:00
Kawrakow	f0c61adacc	Be able to set FA offset via command line argument (#1198 )	2026-01-29 08:56:47 +02:00
Kawrakow	02ae22388f	Apply offfset to KQ_max in CUDA flash attention (#1196 ) * Apply offfset to KQ_max in CUDA flash attention * Forgot to add to fattn-common.h	2026-01-29 07:27:53 +02:00
Kawrakow	68ed62447c	Split mode graph for Minimax-M2 (#1195 ) * Split mode graph for Minimax-M2 * Cleanup * Forgotten ffn_exp_probs_b	2026-01-29 07:27:06 +02:00
Kawrakow	68cd52e583	Much faster long context TG for Minimax-M2 (#1194 )	2026-01-28 10:43:11 +02:00
Kawrakow	f9b5420e6a	Much faster long-context TG for GLM-4.5/4.6/4.7/AIR (#1193 ) * This seems much better for GQA = 12 TG * Remove unused arguments	2026-01-28 10:27:14 +02:00
Kawrakow	69fdd041c1	Remove forgotten unused code	2026-01-26 12:54:21 +00:00
Kawrakow	65441c2385	Even better GLM-4.7-Flash long context TG performance (#1192 ) * Better FA for GLM-4.7-Flash * Adjust ncols for ADA_LOVELACE or better	2026-01-26 13:45:06 +02:00
Kawrakow	30381fc1fc	Faster hybrid inference when shared experts (#1191 )	2026-01-26 07:22:05 +02:00
Kawrakow	478b56871f	Faster long context TG on CUDA for GLM-4.5/4.6/4.7/AIR (part 2) (#1190 ) * This works * Make quantized KV cache work * Remove the glm45 graph building changes * Add condition	2026-01-26 07:21:47 +02:00
Kawrakow	28f8320f3a	Much faster rng sampling (#1187 )	2026-01-25 09:11:27 +02:00
Kawrakow	04beeffa4e	Faster long context TG on CUDA for GLM-4.5/4.6/4.7/AIR (#1183 ) * Similar hack to #1182 for GLM-4.5/6/7 * Refinements * Disable when the KV cache is not f16	2026-01-24 09:39:29 +02:00
Kawrakow	f0fb76da64	Better GLM-4.7-Flash long context TG performance (#1182 ) * Better GLM-4.7-Flash long context TG performance * Handle quantized cache	2026-01-24 07:05:48 +02:00
Kawrakow	2a7cc09149	Remove llamafile remnants (#1179 )	2026-01-22 13:20:23 +02:00
Kawrakow	66caa42b53	Fix build with GGML_CUDA_GRAPHS=OFF	2026-01-22 10:46:57 +00:00
Kawrakow	851fda3509	Split mode graph: use CUDA graphs (#1177 ) * Use GUDA graphs also when theretensor overrides * Change graph key * This seems to work	2026-01-22 12:38:36 +02:00
Kawrakow	573e23679d	sweep_bench: set number of repetions (#1176 )	2026-01-22 12:28:30 +02:00
Kawrakow	101fe54797	CUDA graphs with tensor overrides (#1172 ) * Use GUDA graphs also when theretensor overrides * Change graph key	2026-01-22 12:28:11 +02:00
Kawrakow	1cb8cd534f	Fix build failure when OpenMP is not available (#1171 )	2026-01-22 12:26:23 +02:00
Kawrakow	77c18acc90	Fix non-contiguous batched cuBLAS (#1178 )	2026-01-22 12:25:05 +02:00
Kawrakow	987651e54c	Make comments more precise when experts gating function is missing (#1175 )	2026-01-21 09:12:40 +02:00
Kawrakow	9e07839ba3	Correct GLM-4.7-Flash gating function (#1174 ) * Correct GLM-4.7-Flash gating function * This is better	2026-01-21 07:53:18 +02:00
Kawrakow	6f1a69352f	Fuse experts bias in top_k_moe kernel (#1170 ) * GLM-4.7-Flash support * Model type * Make FA work for mla != 0 * Fuse bias in top_k_moe kernel if present	2026-01-20 15:38:51 +02:00
Kawrakow	996e77047a	Avoid ggml_get_rows if not necessary (#1160 ) * Copy reduce result to other GPUs if necessary * Avoid ggml_get_rows for TG * For the output ops use the result of the split that ran on the main GPU * More models	2026-01-20 15:38:21 +02:00
Kawrakow	132a01d25d	GLM-4.7-Flash support (#1168 ) * GLM-4.7-Flash support * Model type * Make FA work for mla != 0	2026-01-20 12:46:52 +02:00
Kawrakow	ef5f17940c	sampling: refactor sorting (#1166 ) * sampling: refactor sorting * Couldn't look at it without fixing it.	2026-01-19 16:48:54 +02:00
Kawrakow	98b30e5e81	Faster adaptive_p sampling (#1165 ) * A hopefully more efficient adaptive_p sampling * Once at it, lets fix the formatting too * More formatting * Hopefully better * This should be better * Correctly accumulate adaptive_p sampling time * AVX2	2026-01-19 16:03:09 +02:00
Kawrakow	fa58c20c42	A hopefully more efficient adaptive_p sampling (#1161 ) * A hopefully more efficient adaptive_p sampling * Once at it, lets fix the formatting too * More formatting * Correctly accumulate sampling time for adaptive_p	2026-01-19 15:01:55 +02:00
Kawrakow	6a5c180be9	Fix bf16 additions on CUDA arch < Ampere (#1164 ) * Fix bf16 additions on CUDA arch < Ampere * Prevent using NCCL if graph reduce type is bf16 and arch < AMPERE	2026-01-19 12:27:52 +02:00
Kawrakow	0c0b6e4b8b	Copy reduce result to other GPUs if necessary (#1156 )	2026-01-19 08:40:26 +02:00
dungquixote42	6dfbef27ec	Adaptive p: bugfix + optimization + refactor (#1155 ) * adaptive-p sampler: fix zeroed orig_probs bug and refactor - Fix bug where original probabilities were captured as zero by calculating them from logits in llama_prep_adaptive_p (new). - Replace vector with unordered_map to track candidate probabilities, filtering for relevance via logit delta (16.6f). - Standardize API naming: llama_<action/verb>_<focus/name/topic>_<extra/info> - Update function signatures to follow most other samplers. * resolve merge bug * adaptive-p: revert reordering function definitions	2026-01-18 08:26:06 +02:00
firecoperana	d71a3ec315	Server: refactor and rename functions (#1151 ) * Server: rename functions and refactor code rename functions refactor update slots rename params_base rename timings * change * Revert kv cache name changes * Revert 2 * fix test build error --------- Co-authored-by: firecoperana <firecoperana>	2026-01-18 08:16:57 +02:00
Kawrakow	7024fdbc72	Additional graph reduce types for split mode graph (#1154 ) * WIP: add Q8_0 and BF16 as possible reduce types Does not work - there is a big somewhere * This finally works	2026-01-18 08:02:49 +02:00

1 2 3 4 5 ...

4172 Commits