ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-22 22:24:11 +00:00

Author	SHA1	Message	Date
Kawrakow	cbf7fc7e2f	Update README with warning about '_XL' models from Unsloth Added important note regarding quantized models from Unsloth.	2026-02-22 07:42:17 +01:00
Kawrakow	bd387a279a	Add new authors to the AUTHORS file	2026-02-21 19:20:31 +01:00
firecoperana	66323b92f7	Qwen3.5-MoE: fix regenerating message error (#1295 ) Co-authored-by: firecoperana <firecoperana>	2026-02-21 18:24:12 +01:00
Kawrakow	13c3d83ce7	Qwen3.5-MoE support (#1288 ) * WIP: loads and runs, but not correct Very high PPL, empty TG. * This appears to work	2026-02-21 08:33:06 +01:00
mcm007	b2cb4512c5	Create parameters overview (#1269 ) * raw parameters.md * fix small typos in common.cpp * Update build args in parameters.md * Update parameters.md - format as table - sections * Update README.md - quickstart - build and run * Update parameters.md other tools examples * add PR links * multiple updates to parameters.md - description - add jargon section - add suggestions from feedbacks * don't imply that only linux is supported in README.md * add alias to parameters.md * Update README.md with recent models and features * Update parameters.md with latest features * address suggestions - no-ooae - placeholder for common commands - no-kv-offload - llama-sweep-bench - placeholder for unique parameters * specify Linux distro in README.md	2026-02-20 07:20:56 +01:00
dungquixote42	0f411b02e2	Fix adaptive p sampler bug with string ban (#1287 ) * adaptive p: upadte internal state only if not rewinding * adaptive p: conditional update for speculative decoding * adaptive p: refactor to rewind instead of update * adaptive p fix: better comments * fix rewind check * add record to handle multi-token rewind * better comment	2026-02-20 07:11:36 +01:00
rkozuch	b855bf92de	Fix slot prompt updating. (#1285 ) Co-authored-by: Rkozuch <you@example.com>	2026-02-19 08:15:49 +01:00
Kawrakow	d81cde5cea	Fix very low bpw missing imatrix check (#1284 )	2026-02-19 08:15:26 +01:00
Samuel Oliveira Alves	51df09be8a	Feat - add kimi 2.5 Vision (#1280 ) * port kimi 25-vision from upstream * feat(clip): add support for Kimi K2.5 vision model	2026-02-19 08:15:03 +01:00
Kawrakow	04cf685e82	Factor out delta net (#1286 ) * WIP: factor out delta net implementation * WIP * Use the standard FFN functions * More standard attn for Qwen3-Next	2026-02-18 17:16:17 +01:00
Kawrakow	d2d65c0d64	Better CPU performance for Qwen3-Next (#1283 ) * Better CPU silu - +4% PP * Improve ggml_compute_forward_dup_bytes	2026-02-18 15:55:11 +01:00
Kawrakow	84831fc3ee	Don't disable CUDA graphs for Qwen3-Next (#1278 )	2026-02-18 08:47:45 +01:00
Kawrakow	cafeef484c	More Qwen3-Next optimizations (#1277 ) * Optimizing q3next TG * Fused add -> softplus -> mul on CUDA * Remove forgotten debug log * Increase ggml context size Required for Qwen3-Next with batch/u-batch size of 4096 * WIP * Avoid some contiguous ops * Avoid some repeats * Avoid some more repeats	2026-02-17 16:03:51 +01:00
Samuel Oliveira Alves	88f98c891d	server: add string ban in speculative path (#1274 )	2026-02-17 12:33:28 +01:00
Kawrakow	16fe459a49	Faster CPU PP performance for Qwen3-Next - optimize concat (#1276 )	2026-02-17 11:46:27 +01:00
Kawrakow	35c99f9f41	Faster Qwen3-Next PP on CUDA - optimize concat (#1275 )	2026-02-16 11:46:39 +01:00
Kawrakow	97e7c091cd	Update AUTHORS file with new contributors Added new contributors to the AUTHORS file.	2026-02-16 07:13:25 +01:00
firecoperana	868ac2128e	fix build error (#1272 ) Co-authored-by: firecoperana <firecoperana>	2026-02-16 06:51:03 +01:00
Kawrakow	e30198a553	WIP: Qwen3Next (#1266 ) * qwen3next: add architecture support and recurrent-state fixes * qwen3next: optimize broadcast sub and single-seq ssm conv * cuda: build MoE row mapping on device in mul_mat_id * cuda: add guarded multi-seq fast path for ssm_conv * docs: update qwen3next perf report for cuda MoE/SSM tuning * cuda: reduce qwen3next moe/ssm sync overhead and refresh eval * qwen3next: split cpu/cuda eval builds and tune PP scheduling * qwen3next: harden seq-state flow and support optional dense FFN layers * qwen3next: trim delta-net graph overhead in chunking path * qwen3next: remove redundant v_conv cont in delta path * qwen3next: avoid extra cont on linear attention output * qwen3next: drop redundant cont before recurrent state flatten * qwen3next: keep recurrent state in 4d layout through delta path * qwen3next: add fused delta-net op and wire model path * tests: add backend-op coverage for ggml_delta_net * qwen3next: add runtime switch for fused delta-net path * docs: refresh qwen3next perf review and benchmark matrix * qwen3next: default fused delta-net off and document quality checks * qwen3next: add decode-only fused delta mode * qwen3next: make fused delta safe by default and fix fused tensor layout * qwen3next: warn when forcing fused decode mode * qwen3next: add fused-delta regression runner script * qwen3next: integrate fused regression into eval harness * qwen3next: clean up chunked delta-net shape handling * qwen3next: add absolute sanity guards to fused regression * qwen3next: add unified regression runner script * qwen3next: disable flash-attn for cpu-only contexts * docs: reconcile qwen3next status and remaining upstream gaps * common: add qwen3next fused-delta runtime flag * cuda: add qwen3next delta-net kernel dispatch override * docs: update qwen3next quality and serving baseline findings * qwen3next: keep fused delta on safe path and remove PR artifacts * qwen3next: align autoregressive delta-net decode layout * Revert "qwen3next: align autoregressive delta-net decode layout" This reverts commit `9241164a5e`. * cuda: port solve-tri fast-paths for qwen3next delta-net * qwen3next: add fused-delta runtime flag and drop env toggle * qwen3next: make fused delta single-flag and default on * Account for GPU arch differences * Revert "cuda: build MoE row mapping on device in mul_mat_id" This reverts commit `89e9ecfa84`. * qwen3next: drop non-essential MoE scheduling and split heuristics * qwen3next: avoid generic ggml_sub broadcast changes * llama: restore only_active_experts log message * Remove unnecessary hacks, disable fusion for now. * qwen3next: port hybrid recurrent state memory semantics * qwen3next: clean up recurrent state slot plumbing * qwen3next: fix hybrid V-cache layout plumbing * qwen3next: guard recurrent state slots against kv capacity * qwen3next: persist recurrent state in session data - serialize/restore qwen3next cache.s_l in state/session paths\n- bump session and sequence-state file versions for format change\n- fallback to single-token chunking for mixed repeated seq_id batches * qwen3next: drop unused fused-delta builder path - remove dead build_delta_net_fused lambda\n- remove unused llm_build_context::fused_delta member * qwen3next: remove unused fused-delta CLI/context plumbing - drop -fd/-no-fd options and related YAML dump field\n- remove fused_delta fields from public/internal context params\n- remove fused_delta assignment and logging in context init * ggml: remove unused DELTA_NET operator stack * Missing include * Reorder ops/unary ops So we don't change again the enum values of the mul mat ops * Minor * Discard unnecessary changes in llama-build-context.cpp * Minor * Revert "Discard unnecessary changes in llama-build-context.cpp" This reverts commit `edadb80ed6`. * Increase GGML_SCHED_MAX_SPLITS - required for larger u-batches * Fix CPU concat in the TG case: 7.25 -> 10.5 t/s for Qwen3Next * Fix CPU sum_rows: 10.5 -> 13.6 t/s for Qwen3Next It was single-threaded and was taking ~25% of the computation time during TG. It is now down to 2%. Strangely enough, I measure 13.6 t/s with llama-bench, but if I let the model give me an actual response with llama-cli, I get close to 17 t/s. * Fix CPU scale: 13.6 -> 16.7 t/s for Qwen3Next For Qwen3Next there is a scale op on a largish tensor (548k elements) that has a single row for TG, so was done in a single thread. We now simply use blocks of 1024 elements. * Optimize CPU mul: 16.7 -> 17.6 t/s for Qwen3Next * CPU: fuse transpose -> cont -> sum_rows -> transpos: 17.6 -> 23.1 t/s for Qwen3Next * Optimize CPU repeat: 176 -> 200 t/s for Qwen3Next PP-512 * Multithreading for OP_SUB * Don't commit with timing trace on * Multithread neg and sigmoid * Be able to turn on/off fusion more easily (CPU) * Name the mul_mat ops so we know where the time goes * WIP * Much better PP on CUDA * CUDA: fuse transpose -> cont -> sum_rows -> transpose Needs non-coontiguous variant of sum_rows. On the CPU this gave 30+% improvement in TG performance, on CUDA ist is disapointing 6-7%. I guess, this is because Georgi's cont CPU implementation was so bad that skipping it made such a big difference. * CUDA: faster mul for special case relevant for Qwen3Next Worth 1% in TG * Fix CPU OP_CONT --------- Co-authored-by: yurko <yurko@local> Co-authored-by: Yurko <yurko@example.com> Co-authored-by: yurko <yurko@pop-os.tail5a1a6b.ts.net> Co-authored-by: Yurko Hoshko <YurkoHoshko@users.noreply.github.com>	2026-02-16 06:50:28 +01:00
Kawrakow	528cadb07b	GLM-5 support (#1268 )	2026-02-15 07:49:44 +01:00
mcm007	f5fe33b7a9	Update README.md (#1263 ) * Update README.md Add new models and few of the features, quants and improvements * Update README.md ministral3 and split mode "graph"	2026-02-14 09:02:33 +01:00
mcm007	f80505911d	Improve README.md (#1260 )	2026-02-14 09:01:52 +01:00
RodriMora	102f77b7d3	server: add /v1/responses support (#1184 ) * server: add /v1/responses support * server: fix Responses API model fallback and SSE branching	2026-02-14 08:30:18 +01:00
firecoperana	1cb7e1bf39	spec : add self speculative decoding, ngram and refactor (#1261 ) * spec : add self speculative decoding and ngram-mod and refactor common : use common_ prefix for common library function llama : use LLAMA_TOKEN_NULL spec : add self speculative decoding (no draft model required) + refactor spec : add ngram-mod spec : various improvements ton ngram-map + docs spec : fix the check-rate logic of ngram-simple common : add common_speculative_is_compat() spec : simplify time measurement using common_time_meas refactor common_sampler_init refactor common_token_to_piece refactor and fix cur_p bug clean up * spec : remove check rate * spec: show warnings instead of abort --------- Co-authored-by: firecoperana <firecoperana> Co-authored-by: Sascha Rogmann <59577610+srogmann@users.noreply.github.com>	2026-02-13 19:04:55 +01:00
Kawrakow	1fdbc0dafe	Fix #1222 (#1257 ) * Fix #1222 * Typo	2026-02-09 16:20:16 +01:00
Kawrakow	494d70626f	Allow missing rope_frequency_base_swa in Step-3.5 models	2026-02-08 08:59:39 +00:00
Kawrakow	e22b2d1246	Be able to read uint32_t and bool arrays from GGUFs (#1252 )	2026-02-07 19:20:15 +02:00
firecoperana	f1ccf340dd	fix model name missing in final response (#1250 ) Co-authored-by: firecoperana <firecoperana>	2026-02-07 18:31:39 +02:00
mcm007	dbcbfdb0ef	Ik llama swap in container step by step guide (#1249 ) * Create README.md * Add container files and llama-swap configs * Update main README.md * Build without GGML_IQK_FA_ALL_QUANTS Otherwise fails with CUDA_DOCKER_ARCH=default * Mention GGML_IQK_FA_ALL_QUANTS usage * First step more explicit	2026-02-07 18:30:19 +02:00
Kawrakow	82c4f27332	Fuse the attention gate in Step-3.5-Flash (#1244 ) * WIP * This works but is slow * Turn off the up / gate clamps for now * OK we need the clamping * Fuse the clamp (CUDA) * Fuse the clamp (CPU) * WIP * Be able to use merged q, k, v * Be able to use merged up/gate experts * Fuse the clamp (CUDA mmvq) * WIP: graph parallel for Step-3.5 * WIP * This should be it * Cleanup * Fix merge * Not working attempt to extend fused_mul_unary to the Step-3.5 case * It works now, but performance gain is very minor	2026-02-07 07:56:58 +02:00
Kawrakow	90d7499c2c	Step-3.5: llama.cpp compatibility changes (#1240 ) * Step-3.5: llama.cpp compatibility changes * Also read rope_freq_base_train_swa from the GGUF	2026-02-07 07:56:11 +02:00
Kawrakow	c5d74f66e2	Fix graph parallel when ngl < n_layers (#1241 ) * Fix graph parallel when ngl < n_layers * Fix using ffn_norm When using graph parallel with ngl < n_layers, the ffn_norm tensor may have ended up being split, while the ffn tensors are on the CPU. In that case we will get a crash because we attempt to use the not-split buffer of ffn_norm, which is invalid. Thi commit fixes that. * Cleanup	2026-02-06 11:48:24 +02:00
Kawrakow	4d86907b18	Remove forgotten printf	2026-02-06 07:43:18 +00:00
Kawrakow	81ea911f0d	Graph parallel for Step-3.5-Flash (#1236 ) * WIP * This works but is slow * Turn off the up / gate clamps for now * OK we need the clamping * Fuse the clamp (CUDA) * Fuse the clamp (CPU) * WIP * Be able to use merged q, k, v * Be able to use merged up/gate experts * Fuse the clamp (CUDA mmvq) * WIP: graph parallel for Step-3.5 * WIP * This should be it * Cleanup * Fix merge	2026-02-06 06:56:51 +02:00
Kawrakow	5a44324e4a	Bespoke ggml_repeat for Step3.5-Flash (#1239 )	2026-02-06 06:56:09 +02:00
Kawrakow	1ec12b8e3b	Fix #1237 (#1238 )	2026-02-05 18:30:18 +02:00
Kawrakow	51fc78750f	Remove forgotten printf	2026-02-05 16:52:24 +02:00
Kawrakow	a7befb3bed	Change default FA offset to ln(2) (#1235 ) * Change default FA offset to ln(2) * Also here	2026-02-05 13:42:53 +02:00
Kawrakow	9c1c74acda	Step-3.5-Flash support (#1231 ) * WIP * This works but is slow * Turn off the up / gate clamps for now * OK we need the clamping * Fuse the clamp (CUDA) * Fuse the clamp (CPU) * WIP * Be able to use merged q, k, v * Be able to use merged up/gate experts * Fuse the clamp (CUDA mmvq)	2026-02-05 08:13:22 +02:00
firecoperana	8d952ff183	Server: add string ban (#1185 ) * server: add string ban * increase rewind limit * init n_buffer --------- Co-authored-by: firecoperana <firecoperana>	2026-02-05 08:12:34 +02:00
Michael Militzer	a335cff664	Fix llama-server-cuda Dockerfile to build ik_llama.cpp correctly (#1224 ) Co-authored-by: Michael Militzer <michael@xvid.com>	2026-02-04 16:08:00 +02:00
Kawrakow	b41b8cf813	Graph parallel for SEED-OSS (#1222 ) * Graph parallel for SEED-OSS * Cleanup	2026-02-04 16:07:43 +02:00
gapeleon	17d101863d	server: add dynamic control vector management endpoints (#1223 ) This implements the ability to load, unload, and scale control vectors (representation engineering) mid-inference, following the existing task-queue pattern used by LoRA adapters. New Endpoints: - GET /control-vectors - POST /control-vectors/load - POST /control-vectors/unload - POST /control-vectors/apply (handles scaling) Technical Notes: - Centralizes vector aggregation logic to share implementation between load, unload, and apply tasks. - Vectors are applied globally to the model context. - Enforces dimension validation on load to safely reject incompatible vectors. Co-authored-by: Gapeleon <gapeleon@users.noreply.github.com>	2026-02-04 16:07:18 +02:00
usrlocalben	e5622a2e91	Fix Phi-3, Phi-4 (#1226 ) * fix phi3 tensor setup * avoid SWA for Phi-4	2026-02-04 11:57:50 +02:00
Kawrakow	f8acfc2bf0	Better CUDA TG for GQA = 10 (#1221 ) * Better CUDA TG for GQA = 10 * Cleanup	2026-02-03 09:18:46 +02:00
firecoperana	7e8d444033	llama : add token matching support to llama-grammar (#1220 ) * llama : add token matching support to llama-grammar llama : add token matching support to llama-grammar (#17816) common/grammar : replace problematic backtracking regex `[\s\S]` (#18342) disable tests and fix warnings --------- Co-authored-by: firecoperana <firecoperana>	2026-02-03 07:57:17 +02:00
saood06	8ba7e2b40c	Add support for Seed-OSS (#1218 ) * it compiles * Fix constants.py	2026-02-03 07:39:45 +02:00
dungquixote42	b86d8024a5	Adaptive p: history update fix + temp as flag (#1213 ) * adaptive_p: fix history update + use current probability for high temp * adaptive_p: fix history update bug, update with current probability if temp is high * replace temp-as-signal with server argument * adaptive_p: rename ema_w_cur_p to updt_w_cur * delete test code	2026-02-03 07:36:12 +02:00
Kawrakow	589d80f677	Fix CPU FA work buffer size (#1216 )	2026-02-02 12:39:41 +02:00
Kawrakow	49ba462f22	Merge pull request #1215 from ikawrakow/ik/cpu_fa_dont_repack_tg Do not repack q8_0 for batch sizes less than 8	2026-02-02 12:12:34 +02:00

1 2 3 4 5 ...

4215 Commits