ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-03-06 03:50:08 +00:00

Author	SHA1	Message	Date
mcm007	f5fe33b7a9	Update README.md (#1263 ) * Update README.md Add new models and few of the features, quants and improvements * Update README.md ministral3 and split mode "graph"	2026-02-14 09:02:33 +01:00
mcm007	f80505911d	Improve README.md (#1260 )	2026-02-14 09:01:52 +01:00
RodriMora	102f77b7d3	server: add /v1/responses support (#1184 ) * server: add /v1/responses support * server: fix Responses API model fallback and SSE branching	2026-02-14 08:30:18 +01:00
firecoperana	1cb7e1bf39	spec : add self speculative decoding, ngram and refactor (#1261 ) * spec : add self speculative decoding and ngram-mod and refactor common : use common_ prefix for common library function llama : use LLAMA_TOKEN_NULL spec : add self speculative decoding (no draft model required) + refactor spec : add ngram-mod spec : various improvements ton ngram-map + docs spec : fix the check-rate logic of ngram-simple common : add common_speculative_is_compat() spec : simplify time measurement using common_time_meas refactor common_sampler_init refactor common_token_to_piece refactor and fix cur_p bug clean up * spec : remove check rate * spec: show warnings instead of abort --------- Co-authored-by: firecoperana <firecoperana> Co-authored-by: Sascha Rogmann <59577610+srogmann@users.noreply.github.com>	2026-02-13 19:04:55 +01:00
Kawrakow	1fdbc0dafe	Fix #1222 (#1257 ) * Fix #1222 * Typo	2026-02-09 16:20:16 +01:00
Kawrakow	494d70626f	Allow missing rope_frequency_base_swa in Step-3.5 models	2026-02-08 08:59:39 +00:00
Kawrakow	e22b2d1246	Be able to read uint32_t and bool arrays from GGUFs (#1252 )	2026-02-07 19:20:15 +02:00
firecoperana	f1ccf340dd	fix model name missing in final response (#1250 ) Co-authored-by: firecoperana <firecoperana>	2026-02-07 18:31:39 +02:00
mcm007	dbcbfdb0ef	Ik llama swap in container step by step guide (#1249 ) * Create README.md * Add container files and llama-swap configs * Update main README.md * Build without GGML_IQK_FA_ALL_QUANTS Otherwise fails with CUDA_DOCKER_ARCH=default * Mention GGML_IQK_FA_ALL_QUANTS usage * First step more explicit	2026-02-07 18:30:19 +02:00
Kawrakow	82c4f27332	Fuse the attention gate in Step-3.5-Flash (#1244 ) * WIP * This works but is slow * Turn off the up / gate clamps for now * OK we need the clamping * Fuse the clamp (CUDA) * Fuse the clamp (CPU) * WIP * Be able to use merged q, k, v * Be able to use merged up/gate experts * Fuse the clamp (CUDA mmvq) * WIP: graph parallel for Step-3.5 * WIP * This should be it * Cleanup * Fix merge * Not working attempt to extend fused_mul_unary to the Step-3.5 case * It works now, but performance gain is very minor	2026-02-07 07:56:58 +02:00
Kawrakow	90d7499c2c	Step-3.5: llama.cpp compatibility changes (#1240 ) * Step-3.5: llama.cpp compatibility changes * Also read rope_freq_base_train_swa from the GGUF	2026-02-07 07:56:11 +02:00
Kawrakow	c5d74f66e2	Fix graph parallel when ngl < n_layers (#1241 ) * Fix graph parallel when ngl < n_layers * Fix using ffn_norm When using graph parallel with ngl < n_layers, the ffn_norm tensor may have ended up being split, while the ffn tensors are on the CPU. In that case we will get a crash because we attempt to use the not-split buffer of ffn_norm, which is invalid. Thi commit fixes that. * Cleanup	2026-02-06 11:48:24 +02:00
Kawrakow	4d86907b18	Remove forgotten printf	2026-02-06 07:43:18 +00:00
Kawrakow	81ea911f0d	Graph parallel for Step-3.5-Flash (#1236 ) * WIP * This works but is slow * Turn off the up / gate clamps for now * OK we need the clamping * Fuse the clamp (CUDA) * Fuse the clamp (CPU) * WIP * Be able to use merged q, k, v * Be able to use merged up/gate experts * Fuse the clamp (CUDA mmvq) * WIP: graph parallel for Step-3.5 * WIP * This should be it * Cleanup * Fix merge	2026-02-06 06:56:51 +02:00
Kawrakow	5a44324e4a	Bespoke ggml_repeat for Step3.5-Flash (#1239 )	2026-02-06 06:56:09 +02:00
Kawrakow	1ec12b8e3b	Fix #1237 (#1238 )	2026-02-05 18:30:18 +02:00
Kawrakow	51fc78750f	Remove forgotten printf	2026-02-05 16:52:24 +02:00
Kawrakow	a7befb3bed	Change default FA offset to ln(2) (#1235 ) * Change default FA offset to ln(2) * Also here	2026-02-05 13:42:53 +02:00
Kawrakow	9c1c74acda	Step-3.5-Flash support (#1231 ) * WIP * This works but is slow * Turn off the up / gate clamps for now * OK we need the clamping * Fuse the clamp (CUDA) * Fuse the clamp (CPU) * WIP * Be able to use merged q, k, v * Be able to use merged up/gate experts * Fuse the clamp (CUDA mmvq)	2026-02-05 08:13:22 +02:00
firecoperana	8d952ff183	Server: add string ban (#1185 ) * server: add string ban * increase rewind limit * init n_buffer --------- Co-authored-by: firecoperana <firecoperana>	2026-02-05 08:12:34 +02:00
Michael Militzer	a335cff664	Fix llama-server-cuda Dockerfile to build ik_llama.cpp correctly (#1224 ) Co-authored-by: Michael Militzer <michael@xvid.com>	2026-02-04 16:08:00 +02:00
Kawrakow	b41b8cf813	Graph parallel for SEED-OSS (#1222 ) * Graph parallel for SEED-OSS * Cleanup	2026-02-04 16:07:43 +02:00
gapeleon	17d101863d	server: add dynamic control vector management endpoints (#1223 ) This implements the ability to load, unload, and scale control vectors (representation engineering) mid-inference, following the existing task-queue pattern used by LoRA adapters. New Endpoints: - GET /control-vectors - POST /control-vectors/load - POST /control-vectors/unload - POST /control-vectors/apply (handles scaling) Technical Notes: - Centralizes vector aggregation logic to share implementation between load, unload, and apply tasks. - Vectors are applied globally to the model context. - Enforces dimension validation on load to safely reject incompatible vectors. Co-authored-by: Gapeleon <gapeleon@users.noreply.github.com>	2026-02-04 16:07:18 +02:00
usrlocalben	e5622a2e91	Fix Phi-3, Phi-4 (#1226 ) * fix phi3 tensor setup * avoid SWA for Phi-4	2026-02-04 11:57:50 +02:00
Kawrakow	f8acfc2bf0	Better CUDA TG for GQA = 10 (#1221 ) * Better CUDA TG for GQA = 10 * Cleanup	2026-02-03 09:18:46 +02:00
firecoperana	7e8d444033	llama : add token matching support to llama-grammar (#1220 ) * llama : add token matching support to llama-grammar llama : add token matching support to llama-grammar (#17816) common/grammar : replace problematic backtracking regex `[\s\S]` (#18342) disable tests and fix warnings --------- Co-authored-by: firecoperana <firecoperana>	2026-02-03 07:57:17 +02:00
saood06	8ba7e2b40c	Add support for Seed-OSS (#1218 ) * it compiles * Fix constants.py	2026-02-03 07:39:45 +02:00
dungquixote42	b86d8024a5	Adaptive p: history update fix + temp as flag (#1213 ) * adaptive_p: fix history update + use current probability for high temp * adaptive_p: fix history update bug, update with current probability if temp is high * replace temp-as-signal with server argument * adaptive_p: rename ema_w_cur_p to updt_w_cur * delete test code	2026-02-03 07:36:12 +02:00
Kawrakow	589d80f677	Fix CPU FA work buffer size (#1216 )	2026-02-02 12:39:41 +02:00
Kawrakow	49ba462f22	Merge pull request #1215 from ikawrakow/ik/cpu_fa_dont_repack_tg Do not repack q8_0 for batch sizes less than 8	2026-02-02 12:12:34 +02:00
Kawrakow	d5498c4467	Do not repack q8_0 for batch sizes less than 8	2026-02-02 09:07:45 +00:00
Kawrakow	a527b5af25	Merge pull request #1212 from ikawrakow/ik/better_cpu_fa_thread_strategy Better long-context CPU performance	2026-02-02 10:58:01 +02:00
Kawrakow	685df0e69d	Work buffer size	2026-01-31 16:10:23 +00:00
Kawrakow	2bf2fa8ba4	Better CPU FA thread strategy	2026-01-31 15:46:16 +00:00
Kawrakow	33308908db	Merge pull request #1211 from ikawrakow/ik/reduce_mla3_compute_buffer_size Reduce CUDA compute buffer size for mla=3	2026-01-31 14:24:14 +02:00
Kawrakow	b85a2a50d5	Reduce compute buffer size for mla=3	2026-01-31 10:43:05 +00:00
Kawrakow	373f043d41	Merge pull request #1208 from ikawrakow/ik/try_fix_1201	2026-01-30 23:12:07 +02:00
Kawrakow	4d13ae03b5	Also these other two places	2026-01-30 15:36:29 +00:00
Kawrakow	098b1a2e04	Fix MiniMax-M2 KV-cache loading/saving	2026-01-30 13:38:07 +00:00
Kawrakow	811f8c3393	Fix bug in the CPU flash attention implementation (#1206 )	2026-01-30 11:37:34 +02:00
Kawrakow	686fd1ebec	Use standard output calculation for MiniMax-M2 graph parallel (#1199 )	2026-01-29 09:06:40 +02:00
Kawrakow	f0c61adacc	Be able to set FA offset via command line argument (#1198 )	2026-01-29 08:56:47 +02:00
Kawrakow	02ae22388f	Apply offfset to KQ_max in CUDA flash attention (#1196 ) * Apply offfset to KQ_max in CUDA flash attention * Forgot to add to fattn-common.h	2026-01-29 07:27:53 +02:00
Kawrakow	68ed62447c	Split mode graph for Minimax-M2 (#1195 ) * Split mode graph for Minimax-M2 * Cleanup * Forgotten ffn_exp_probs_b	2026-01-29 07:27:06 +02:00
Kawrakow	68cd52e583	Much faster long context TG for Minimax-M2 (#1194 )	2026-01-28 10:43:11 +02:00
Kawrakow	f9b5420e6a	Much faster long-context TG for GLM-4.5/4.6/4.7/AIR (#1193 ) * This seems much better for GQA = 12 TG * Remove unused arguments	2026-01-28 10:27:14 +02:00
Kawrakow	69fdd041c1	Remove forgotten unused code	2026-01-26 12:54:21 +00:00
Kawrakow	65441c2385	Even better GLM-4.7-Flash long context TG performance (#1192 ) * Better FA for GLM-4.7-Flash * Adjust ncols for ADA_LOVELACE or better	2026-01-26 13:45:06 +02:00
Kawrakow	30381fc1fc	Faster hybrid inference when shared experts (#1191 )	2026-01-26 07:22:05 +02:00
Kawrakow	478b56871f	Faster long context TG on CUDA for GLM-4.5/4.6/4.7/AIR (part 2) (#1190 ) * This works * Make quantized KV cache work * Remove the glm45 graph building changes * Add condition	2026-01-26 07:21:47 +02:00

1 2 3 4 5 ...

4195 Commits